How to select columns with names as dates using dplyr R - r

I have dataframe where some headers have names a character and dates. I want to select all the columns which dont have dates as header and all the columns which have the dates less than the current date or system date(sys.date()). How can I do thins using select statement in dplyr.
Below is the dataframe
> dput(job_times[1:5,])
structure(list(Skill = c("KAC", "KAC", "KAC", "KAC", "KAC"),
Patch = c("A1", "A2", "A3", "A4", "A5"), `Work Code` = c("W01",
"W01", "W01", "W01", "W01"), Product = c("KAC Repair", "KAC Repair",
"KAC Repair", "KAC Repair", "KAC Repair"), `Visit Time` = c(45.68,
42.55, 46.45, 51.86, 43.49), Travel = c(32.5, 21.66, 26.33,
28.63, 27.03), Success = c(0.69, 0.66, 0.67, 0.65, 0.67),
`Completion Time` = c(1.9, 1.61, 1.8, 2.05, 1.74), `28-12-2020` = c(1.9,
1.61, 1.8, 2.05, 1.74), `04-01-2021` = c(1.9, 1.61, 1.8,
2.05, 1.74), `11-01-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`18-01-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74), `25-01-2021` = c(1.9,
1.61, 1.8, 2.05, 1.74), `01-02-2021` = c(1.9, 1.61, 1.8,
2.05, 1.74), `08-02-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`15-02-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74), `22-02-2021` = c(1.9,
1.61, 1.8, 2.05, 1.74), `01-03-2021` = c(1.9, 1.61, 1.8,
2.05, 1.74), `08-03-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`15-03-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74), `22-03-2021` = c(1.9,
1.61, 1.8, 2.05, 1.74), `29-03-2021` = c(1.9, 1.61, 1.8,
2.05, 1.74), `05-04-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`12-04-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74), `19-04-2021` = c(1.9,
1.61, 1.8, 2.05, 1.74), `26-04-2021` = c(1.9, 1.61, 1.8,
2.05, 1.74), `03-05-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`10-05-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74), `17-05-2021` = c(1.9,
1.61, 1.8, 2.05, 1.74), `24-05-2021` = c(1.9, 1.61, 1.8,
2.05, 1.74), `31-05-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`07-06-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74), `14-06-2021` = c(1.9,
1.61, 1.8, 2.05, 1.74), `21-06-2021` = c(1.9, 1.61, 1.8,
2.05, 1.74), `28-06-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`05-07-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74), `12-07-2021` = c(1.9,
1.61, 1.8, 2.05, 1.74), `19-07-2021` = c(1.9, 1.61, 1.8,
2.05, 1.74), `26-07-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`02-08-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74), `09-08-2021` = c(1.9,
1.61, 1.8, 2.05, 1.74), `16-08-2021` = c(1.9, 1.61, 1.8,
2.05, 1.74), `23-08-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`30-08-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74), `06-09-2021` = c(1.9,
1.61, 1.8, 2.05, 1.74), `13-09-2021` = c(1.9, 1.61, 1.8,
2.05, 1.74), `20-09-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`27-09-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74), `04-10-2021` = c(1.9,
1.61, 1.8, 2.05, 1.74), `11-10-2021` = c(1.9, 1.61, 1.8,
2.05, 1.74), `18-10-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`25-10-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74), `01-11-2021` = c(1.9,
1.61, 1.8, 2.05, 1.74), `08-11-2021` = c(1.9, 1.61, 1.8,
2.05, 1.74), `15-11-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`22-11-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74), `29-11-2021` = c(1.9,
1.61, 1.8, 2.05, 1.74), `06-12-2021` = c(1.9, 1.61, 1.8,
2.05, 1.74), `13-12-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`20-12-2021` = c(1.9, 1.61, 1.8, 2.05, 1.74), `27-12-2021` = c(1.9,
1.61, 1.8, 2.05, 1.74), `03-01-2022` = c(1.9, 1.61, 1.8,
2.05, 1.74), `10-01-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`17-01-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74), `24-01-2022` = c(1.9,
1.61, 1.8, 2.05, 1.74), `31-01-2022` = c(1.9, 1.61, 1.8,
2.05, 1.74), `07-02-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`14-02-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74), `21-02-2022` = c(1.9,
1.61, 1.8, 2.05, 1.74), `28-02-2022` = c(1.9, 1.61, 1.8,
2.05, 1.74), `07-03-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`14-03-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74), `21-03-2022` = c(1.9,
1.61, 1.8, 2.05, 1.74), `28-03-2022` = c(1.9, 1.61, 1.8,
2.05, 1.74), `04-04-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`11-04-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74), `18-04-2022` = c(1.9,
1.61, 1.8, 2.05, 1.74), `25-04-2022` = c(1.9, 1.61, 1.8,
2.05, 1.74), `02-05-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`09-05-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74), `16-05-2022` = c(1.9,
1.61, 1.8, 2.05, 1.74), `23-05-2022` = c(1.9, 1.61, 1.8,
2.05, 1.74), `30-05-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`06-06-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74), `13-06-2022` = c(1.9,
1.61, 1.8, 2.05, 1.74), `20-06-2022` = c(1.9, 1.61, 1.8,
2.05, 1.74), `27-06-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`04-07-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74), `11-07-2022` = c(1.9,
1.61, 1.8, 2.05, 1.74), `18-07-2022` = c(1.9, 1.61, 1.8,
2.05, 1.74), `25-07-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`01-08-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74), `08-08-2022` = c(1.9,
1.61, 1.8, 2.05, 1.74), `15-08-2022` = c(1.9, 1.61, 1.8,
2.05, 1.74), `22-08-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`29-08-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74), `05-09-2022` = c(1.9,
1.61, 1.8, 2.05, 1.74), `12-09-2022` = c(1.9, 1.61, 1.8,
2.05, 1.74), `19-09-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`26-09-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74), `03-10-2022` = c(1.9,
1.61, 1.8, 2.05, 1.74), `10-10-2022` = c(1.9, 1.61, 1.8,
2.05, 1.74), `17-10-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`24-10-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74), `31-10-2022` = c(1.9,
1.61, 1.8, 2.05, 1.74), `07-11-2022` = c(1.9, 1.61, 1.8,
2.05, 1.74), `14-11-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`21-11-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74), `28-11-2022` = c(1.9,
1.61, 1.8, 2.05, 1.74), `05-12-2022` = c(1.9, 1.61, 1.8,
2.05, 1.74), `12-12-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74),
`19-12-2022` = c(1.9, 1.61, 1.8, 2.05, 1.74), `26-12-2022` = c(1.9,
1.61, 1.8, 2.05, 1.74)), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
I want the Skill, Patch, Work Code, Product, Visit Time, Travel, Success, Completion Time columns along with all the columns which have their dates less than or equal to sys.Date(). Using dplyr and select statements.

This is how I would solve it -
cols <- grep('\\d{2}-\\d{2}-\\d{2}', names(job_times), value = TRUE)
result <- job_times[, c(setdiff(names(job_times), cols),
cols[Sys.Date() > as.Date(cols, '%d-%m-%Y')])]
You can integrate this in dplyr pipe as -
library(dplyr)
job_times %>%
select({
cols <- grep('\\d{2}-\\d{2}-\\d{2}', names(.), value = TRUE)
c(setdiff(names(.), cols),
cols[Sys.Date() > as.Date(cols, '%d-%m-%Y')])
})

I would suggest creating a helper function and then you can use select like this:
library(tidyverse)
library(lubridate)
is_before_today <- function(x) {
(dmy(x, quiet = TRUE) < Sys.Date()) %>% coalesce(FALSE)
}
df %>%
select(
matches("^\\D"), all_of(colnames(.) %>% keep(is_before_today))
)
#> # A tibble: 5 x 38
#> Skill Patch `Work Code` Product `Visit Time` Travel Success `Completion Tim~
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 KAC A1 W01 KAC Repa~ 45.7 32.5 0.69 1.9
#> 2 KAC A2 W01 KAC Repa~ 42.6 21.7 0.66 1.61
#> 3 KAC A3 W01 KAC Repa~ 46.4 26.3 0.67 1.8
#> 4 KAC A4 W01 KAC Repa~ 51.9 28.6 0.65 2.05
#> 5 KAC A5 W01 KAC Repa~ 43.5 27.0 0.67 1.74
#> # ... with 30 more variables: 28-12-2020 <dbl>, 04-01-2021 <dbl>,
#> # 11-01-2021 <dbl>, 18-01-2021 <dbl>, 25-01-2021 <dbl>, 01-02-2021 <dbl>,
#> # 08-02-2021 <dbl>, 15-02-2021 <dbl>, 22-02-2021 <dbl>, 01-03-2021 <dbl>,
#> # 08-03-2021 <dbl>, 15-03-2021 <dbl>, 22-03-2021 <dbl>, 29-03-2021 <dbl>,
#> # 05-04-2021 <dbl>, 12-04-2021 <dbl>, 19-04-2021 <dbl>, 26-04-2021 <dbl>,
#> # 03-05-2021 <dbl>, 10-05-2021 <dbl>, 17-05-2021 <dbl>, 24-05-2021 <dbl>,
#> # 31-05-2021 <dbl>, 07-06-2021 <dbl>, 14-06-2021 <dbl>, 21-06-2021 <dbl>,
#> # 28-06-2021 <dbl>, 05-07-2021 <dbl>, 12-07-2021 <dbl>, 19-07-2021 <dbl>
Created on 2021-07-20 by the reprex package (v1.0.0)

Ronak Shah's answer is extremely great. Here is how I'll do.
## Get the List of All Column Names
ColumnNames <- names(TestDF)
## Retain Only those don't have Dates
CharacterColumnNames <- ColumnNames[grepl( "[[:alpha:]]" , names( TestDF ) )]
## Get the List of all Date Column Names
DateColumns <- setdiff(names(TestDF),CharacterColumnNames)
## Filter Required Date Column Names
RequiredDateColumns <- DateColumns[ Sys.Date() > as.Date(DateColumns, '%d-%m-%Y')]
## Get the Modified DF
ModifiedDF <- TestDF[, c(CharacterColumnNames, RequiredDateColumns)]

Related

Canonical Correlation in R with different matrix dimensions

I'm having difficulties about doing a CC analysis in R.
The assignment which I'm doing is from "Applied Multivariate Analysis" by Sharma, exercise 13.7, if you're familiar with it.
Basically, I'm asked to conduct a CCA on a set of variables. There are seven X variables, but only five Y variables, thus R complains that the dimensions are not compatible. See the image below for a visual representation of the data called CETNEW.
Edited (Changed from image to dput):
structure(list(...
1 = c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "Y1", "Y2", "Y3", "Y4", "Y5"),
2 = c(2.72, 1.2, 0.82, 0.92, 1.19, 1, 1.45, 0.68, 0.98, 0.57, 1.07, 0.91), ...
3 = c(1.2, 3.78, 0.7, 1.04, 1.06, 1.32, 1.31, 0.56, 1, 0.79, 1.13, 1.38), ...
4 = c(0.82, 0.7, 1.7, 0.59, 0.83, 1.08, 1.01, 0.65, 0.78, 0.66, 0.93, 0.77), ...
5 = c(0.92, 1.04, 0.59, 3.09, 1.06, 0.93, 1.47, 0.62, 1.26, 0.51, 0.94, 0.85), ...
6 = c(1.19, 1.06, 0.83, 1.06, 2.94, 1.36, 1.66, 0.68, 1.16, 0.77, 1.37, 1.11), ...
7 = c(1, 1.32, 1.08, 0.93, 1.36, 2.94, 1.56, 0.9, 1.23, 0.78, 1.65, 1.31), ...
8 = c(1.45, 1.31, 1.01, 1.47, 1.66, 1.56, 3.11, 1.03, 1.7, 0.81, 1.63, 1.44), ...
9 = c(0.68, 0.56, 0.65, 0.62, 0.68, 0.9, 1.03, 1.71, 0.99, 0.65, 0.86, 0.72), ...
10 = c(0.98, 1, 0.78, 1.26, 1.16, 1.23, 1.7, 0.99, 3.07, 0.61, 1.43, 1.28), ...
11 = c(0.57, 0.79, 0.66, 0.51, 0.77, 0.78, 0.81, 0.65, 0.61, 2.83, 1.04, 0.84), ...
12 = c(1.07, 1.13, 0.93, 0.94, 1.37, 1.65, 1.63, 0.86, 1.43, 1.04, 2.83, 1.6), ...
13 = c(0.91, 1.38, 0.77, 0.85, 1.11, 1.31, 1.44, 0.72, 1.28, 0.84, 1.6, 4.01)),
row.names = c(NA, -12L), class = c("tbl_df", "tbl", "data.frame"))
What I've Done so Far
CETNEW <- CETNEW[,-1] #To remove the non-numeric values
Create two variables (criterion and predictor variables) as:
CETNEWx <- CETNEW[1:7,]
CETNEWy <- CETNEW[8:12,]
Then I've been using various packages such as CCA, CCP and candisk. From CCA:
ccCETNEW <- cc(CETNEWx,CETNEWy)
Yields the following error message:
Error in cov(X, Y, use = "pairwise") : incompatible dimensions
The matcor function also from CCA, yields the following error message:
Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 7, 5
Thus, it would seem that it all boils down to the different dimension problem. I've talked to my professor about it, but since he is using SAS, which apparently are compatible with this problem and could solve it, he could not help me.
Please, if you're familiar with canonical correlation and have had a similar problem before, any help regarding this topic is highly appreciated.
If you look at your data, notice the first column is divided into X and Y labels. That suggests to me that your data are transposed. If so, each column is an observation and the X and Y labels indicate various measurements taken on each observation. Canonical correlations are performed on two groups of measurements/variables from a single set of observations. First, here is the transposed data:
CETNEW.T <- structure(list(X1 = c(2.72, 1.2, 0.82, 0.92, 1.19, 1, 1.45, 0.68,
0.98, 0.57, 1.07, 0.91), X2 = c(1.2, 3.78, 0.7, 1.04, 1.06, 1.32,
1.31, 0.56, 1, 0.79, 1.13, 1.38), X3 = c(0.82, 0.7, 1.7, 0.59,
0.83, 1.08, 1.01, 0.65, 0.78, 0.66, 0.93, 0.77), X4 = c(0.92,
1.04, 0.59, 3.09, 1.06, 0.93, 1.47, 0.62, 1.26, 0.51, 0.94, 0.85
), X5 = c(1.19, 1.06, 0.83, 1.06, 2.94, 1.36, 1.66, 0.68, 1.16,
0.77, 1.37, 1.11), X6 = c(1, 1.32, 1.08, 0.93, 1.36, 2.94, 1.56,
0.9, 1.23, 0.78, 1.65, 1.31), X7 = c(1.45, 1.31, 1.01, 1.47,
1.66, 1.56, 3.11, 1.03, 1.7, 0.81, 1.63, 1.44), Y1 = c(0.68,
0.56, 0.65, 0.62, 0.68, 0.9, 1.03, 1.71, 0.99, 0.65, 0.86, 0.72
), Y2 = c(0.98, 1, 0.78, 1.26, 1.16, 1.23, 1.7, 0.99, 3.07, 0.61,
1.43, 1.28), Y3 = c(0.57, 0.79, 0.66, 0.51, 0.77, 0.78, 0.81,
0.65, 0.61, 2.83, 1.04, 0.84), Y4 = c(1.07, 1.13, 0.93, 0.94,
1.37, 1.65, 1.63, 0.86, 1.43, 1.04, 2.83, 1.6), Y5 = c(0.91,
1.38, 0.77, 0.85, 1.11, 1.31, 1.44, 0.72, 1.28, 0.84, 1.6, 4.01
)), class = "data.frame", row.names = c(NA, -12L))
Now the analysis runs fine:
library("CCA")
str(CETNEW.T)
# 'data.frame': 12 obs. of 12 variables:
# $ X1: num 2.72 1.2 0.82 0.92 1.19 1 1.45 0.68 0.98 0.57 ...
# $ X2: num 1.2 3.78 0.7 1.04 1.06 1.32 1.31 0.56 1 0.79 ...
# $ X3: num 0.82 0.7 1.7 0.59 0.83 1.08 1.01 0.65 0.78 0.66 ...
# $ X4: num 0.92 1.04 0.59 3.09 1.06 0.93 1.47 0.62 1.26 0.51 ...
# $ X5: num 1.19 1.06 0.83 1.06 2.94 1.36 1.66 0.68 1.16 0.77 ...
# $ X6: num 1 1.32 1.08 0.93 1.36 2.94 1.56 0.9 1.23 0.78 ...
# $ X7: num 1.45 1.31 1.01 1.47 1.66 1.56 3.11 1.03 1.7 0.81 ...
# $ Y1: num 0.68 0.56 0.65 0.62 0.68 0.9 1.03 1.71 0.99 0.65 ...
# $ Y2: num 0.98 1 0.78 1.26 1.16 1.23 1.7 0.99 3.07 0.61 ...
# $ Y3: num 0.57 0.79 0.66 0.51 0.77 0.78 0.81 0.65 0.61 2.83 ...
# $ Y4: num 1.07 1.13 0.93 0.94 1.37 1.65 1.63 0.86 1.43 1.04 ...
# $ Y5: num 0.91 1.38 0.77 0.85 1.11 1.31 1.44 0.72 1.28 0.84 ...
X <- CETNEW.T[, 1:7]
Y <- CETNEW.T[, 8:12]
ccCETNEW <- cc(X, Y)
ccCETNEW is list with 5 parts containing the results.

Problems converting NA to numeric in a data frame in R

I have a data frame with both numeric values and characters. The NAs are not displayed as NA defined by R, but as characters. How can I convert the characters NA to the numeric value 0? I still want the columns to show 0, because I do not want to remove them from my data frame.
I have tried
df[is.na(df)] <-0
but it only returns "0" as characters, but not values.
df <- as.numeric(as.character(df))
gives me the warning message:
NAs introduced by coercion
Are there other solutions available? Thank you.
Here is a small reproducible example:
structure(list(DNB = c(2.05, 2.05, 2.06, 2.32, 2.32, 2.32), `NORSK HYDRO` =
c(2.59,
2.59, 2.65, 2.81, 2.63, 2.63), ORKLA = c(2.29, 2.29, 2.18, 2.31,
2.25, 2.25), STOREBRAND = c(2.28, 2.28, 2.56, 2.88, 2.94, 2.94
), ATEA = c(2.25, 2.25, 2, 2, 2, 2), `SCHIBSTED A` = c(3.23,
3.23, 3.08, 2.92, 2.92, 2.92), BONHEUR = c(2, 2, 2, 2, 2, 2),
EKORNES = c(2.25, 2.25, 2.25, 2.25, 2.25, 2.25), `KONGSBERG GRUPPEN` =
c(2.8,
2.8, 2.5, 2.5, 2.5, 2.5), `TOMRA SYSTEMS` = c(2.43, 2.43,
2.29, 2.29, 2.29, 2.29), VEIDEKKE = c(2.33, 2.33, 2.5, 2.5,
2.33, 2.33), `ARENDALS FOSSEKOMPANI` = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), `OLAV THON EIEP.` = c(3, 3, 2.8, 2.8, 2.8, 2.8), `PETROLEUM GEO SERVICES` = c(3.13,
3.13, 2.86, 2.63, 2.63, 2.63), `SPAREBANK 1 SR BANK` = c(3,
3, 3, 3, 3, 3), `STOLT-NIELSEN` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `ODFJELL 'A'` = c(2.45, 2.45,
2.4, 2.6, 2.4, 2.4), `SPAREBANK 1 NORD-NORGE` = c(3, 3, 3,
3, 3, 3), `SPAREBANK 1 SMN` = c(3, 3, 3, 3, 3, 3), `WILHS.WILHELMSEN HDG.'A'` = c(2.67,
2.67, 2.78, 2.67, 2.67, 2.67), `NORDEA BANK (~NK)` = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `ATLAS COPCO 'A' (~NK)` = c(3.08,
3.08, 3.1, 2.95, 2.95, 2.95), `VOLVO 'B' (~NK)` = c(3.13,
3.13, 3.17, 2.79, 2.59, 2.59), `SANDVIK (~NK)` = c(3, 3,
2.75, 3.04, 3.09, 3.09), `SWEDBANK 'A' (~NK)` = c(2.29, 2.29,
2.21, 2.05, 2.1, 2.1), `ERICSSON 'B' (~NK)` = c(2.33, 2.33,
2.38, 2.52, 2.44, 2.44), `SVENSKA HANDBKN.'A' (~NK)` = c(2.32,
2.32, 2.33, 2.55, 2.55, 2.55), `HENNES & MAURITZ 'B' (~NK)` = c(3.35,
3.35, 3.42, 3.17, 3.06, 3.06), `SEB 'A' (~NK)` = c(2.9, 2.9,
2.9, 3, 3.09, 3.09), `INVESTOR 'B' (~NK)` = c(2.47, 2.47,
2.38, 2.69, 2.62, 2.62), `SWEDISH MATCH (~NK)` = c(2.08,
2.08, 1.83, 1.69, 1.69, 1.69), `ELECTROLUX 'B' (~NK)` = c(3.38,
3.38, 3.23, 3.13, 3.13, 3.13), `SKANSKA 'B' (~NK)` = c(2.5,
2.5, 2.43, 2.85, 2.86, 2.86), `SCA 'B' (~NK)` = c(2.96, 2.96,
2.87, 2.64, 2.55, 2.55), `SECURITAS 'B' (~NK)` = c(3.64,
3.64, 3.78, 4, 4, 4), `HOLMEN 'B' (~NK)` = c(3.16, 3.16,
3.26, 3.05, 3.24, 3.24), `SSAB 'A' (~NK)` = c(2.33, 2.33,
2.29, 2.41, 2.41, 2.41), `ERICSSON 'A' (~NK)` = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_), `INVESTOR 'A' (~NK)` = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), `VOLVO 'A' (~NK)` = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), `NOVO NORDISK 'B' (~NK)` = c(2.52,
2.52, 2.55, 2.64, 2.55, 2.55), `DANSKE BANK (~NK)` = c(2.12,
2.12, 2.38, 2.53, 2.58, 2.58), `COLOPLAST 'B' (~NK)` = c(3.8,
3.8, 4.13, 4.13, 4.13, 4.13), `CARLSBERG 'B' (~NK)` = c(3.11,
3.11, 3.06, 3.24, 3.24, 3.24), `A P MOLLER - MAERSK 'B' (~NK)` = c(2.89,
2.89, 2.75, 2.63, 2.75, 2.75), `TDC (~NK)` = c(2.93, 2.93,
2.96, 2.96, 3.04, 3.04), `TOPDANMARK (~NK)` = c(2.78, 2.78,
2.56, 2.8, 2.8, 2.8), `WILLIAM DEMANT HLDG. (~NK)` = c(4,
4, 3.78, 4, 3.78, 3.78), `JYSKE BANK (~NK)` = c(1.5, 1.5,
1.5, 1.5, 1.5, 1.5), `KOBENHAVNS LUFTHAVNE (~NK)` = c(2.56,
2.56, 2.47, 2.75, 2.56, 2.56), `NKT (~NK)` = c(2.25, 2.25,
2.25, 2.25, 2.25, 2.25), `ROCKWOOL 'B' (~NK)` = c(3.25, 3.25,
3, 3, 3, 3), `SYDBANK (~NK)` = c(3.6, 3.6, 3.2, 4, 4, 4),
`FLSMIDTH & CO.'B' (~NK)` = c(2.6, 2.6, 2.4, 2.4, 2.4, 2.4
), `GN STORE NORD (~NK)` = c(3, 3, 2.78, 2.89, 3.11, 3.11
), `ALK-ABELLO (~NK)` = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), `BANG & OLUFSEN 'B' (~NK)` = c(4, 4,
3.67, 3.22, 3.22, 3.22), `SANTA FE GROUP (~NK)` = c(3.5,
3.5, 3.4, 3.22, 3.44, 3.44), `CARLSBERG 'A' (~NK)` = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `ROCKWOOL 'A' (~NK)` = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `NOKIA (~NK)` = c(1.89,
1.89, 2.04, 1.86, 1.81, 1.81), `SAMPO 'A' (~NK)` = c(2.08,
2.08, 2, 2.36, 2.36, 2.36), `KONE 'B' (~NK)` = c(3.71, 3.71,
3.77, 3.67, 3.64, 3.64), `UPM-KYMMENE (~NK)` = c(2.43, 2.43,
2.45, 2.09, 2.04, 2.04), `WARTSILA (~NK)` = c(2.13, 2.13,
2.07, 2.07, 2.07, 2.07), `METSO (~NK)` = c(2.41, 2.41, 2.41,
2.47, 2.47, 2.47), `STORA ENSO 'R' (~NK)` = c(2.76, 2.76,
2.95, 2.74, 2.57, 2.57), `HUHTAMAKI (~NK)` = c(2.33, 2.33,
2.13, 2.25, 2.25, 2.25), `FINNAIR (~NK)` = c(3, 3, 3, 2.92,
2.92, 2.92), `KEMIRA (~NK)` = c(2.4, 2.4, 2.4, 2.67, 2.8,
2.8), `UPONOR (~NK)` = c(2, 2, 2, 1.8, 1.8, 1.8), `KESKO 'B' (~NK)` = c(2.45,
2.45, 3.09, 2.58, 2.67, 2.67), `ORION 'B' (~NK)` = c(2.57,
2.57, 2.57, 2.63, 2.63, 2.63), `OUTOKUMPU 'A' (~NK)` = c(3.31,
3.31, 3.31, 3, 2.63, 2.63), `RAISIO (~NK)` = c(2.91, 2.91,
3.09, 3.08, 3, 3), `TIETO OYJ (~NK)` = c(2, 2, 2.11, 2.4,
2.4, 2.4), `METSA BOARD 'B' (~NK)` = c(3.26, 3.26, 3.32,
3.14, 2.84, 2.84), `ORION 'A' (~NK)` = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), `STOCKMANN 'A' (~NK)` = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), `STORA ENSO 'A' (~NK)` = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
)), .Names = c("DNB", "NORSK HYDRO", "ORKLA", "STOREBRAND",
"ATEA", "SCHIBSTED A", "BONHEUR", "EKORNES", "KONGSBERG GRUPPEN",
"TOMRA SYSTEMS", "VEIDEKKE", "ARENDALS FOSSEKOMPANI", "OLAV THON EIEP.",
"PETROLEUM GEO SERVICES", "SPAREBANK 1 SR BANK", "STOLT-NIELSEN",
"ODFJELL 'A'", "SPAREBANK 1 NORD-NORGE", "SPAREBANK 1 SMN",
"WILHS.WILHELMSEN HDG.'A'",
"NORDEA BANK (~NK)", "ATLAS COPCO 'A' (~NK)", "VOLVO 'B' (~NK)",
"SANDVIK (~NK)", "SWEDBANK 'A' (~NK)", "ERICSSON 'B' (~NK)",
"SVENSKA HANDBKN.'A' (~NK)", "HENNES & MAURITZ 'B' (~NK)", "SEB 'A' (~NK)",
"INVESTOR 'B' (~NK)", "SWEDISH MATCH (~NK)", "ELECTROLUX 'B' (~NK)",
"SKANSKA 'B' (~NK)", "SCA 'B' (~NK)", "SECURITAS 'B' (~NK)",
"HOLMEN 'B' (~NK)", "SSAB 'A' (~NK)", "ERICSSON 'A' (~NK)", "INVESTOR 'A'
(~NK)",
"VOLVO 'A' (~NK)", "NOVO NORDISK 'B' (~NK)", "DANSKE BANK (~NK)",
"COLOPLAST 'B' (~NK)", "CARLSBERG 'B' (~NK)", "A P MOLLER - MAERSK 'B'
(~NK)",
"TDC (~NK)", "TOPDANMARK (~NK)", "WILLIAM DEMANT HLDG. (~NK)",
"JYSKE BANK (~NK)", "KOBENHAVNS LUFTHAVNE (~NK)", "NKT (~NK)",
"ROCKWOOL 'B' (~NK)", "SYDBANK (~NK)", "FLSMIDTH & CO.'B' (~NK)",
"GN STORE NORD (~NK)", "ALK-ABELLO (~NK)", "BANG & OLUFSEN 'B' (~NK)",
"SANTA FE GROUP (~NK)", "CARLSBERG 'A' (~NK)", "ROCKWOOL 'A' (~NK)",
"NOKIA (~NK)", "SAMPO 'A' (~NK)", "KONE 'B' (~NK)", "UPM-KYMMENE (~NK)",
"WARTSILA (~NK)", "METSO (~NK)", "STORA ENSO 'R' (~NK)", "HUHTAMAKI (~NK)",
"FINNAIR (~NK)", "KEMIRA (~NK)", "UPONOR (~NK)", "KESKO 'B' (~NK)",
"ORION 'B' (~NK)", "OUTOKUMPU 'A' (~NK)", "RAISIO (~NK)", "TIETO OYJ (~NK)",
"METSA BOARD 'B' (~NK)", "ORION 'A' (~NK)", "STOCKMANN 'A' (~NK)",
"STORA ENSO 'A' (~NK)"), row.names = c(NA, 6L), class = "data.frame")
We can loop over the columns of dataset, replace the NAs with 0 and convert it to numeric (as there are some character columns)
df[] <- lapply(df, function(x) as.numeric(replace(x, is.na(x), 0)))
The OP's method of replacing the NAs with 0 first should also work, but the character columns remain as character unless we change it
df[is.na(df)] <-0
df[] <- lapply(df, as.numeric)
Here, we don't have any factor columns, so as.character is not needed. Note that as.character/as.numeric are applied on vector/columns and not on the entire dataset

How to compute/plot efficient frontiers per time period in one graph in R?

Currently we compute and sort data of stocks (X1 to X10). Historical data is stored in Excel and R for the time period 1950-1980, 1980-1999 and for 1950-1999.
The dataset:
date X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1950-01-01 5.92 6.35 4.61 4.08 5.47 3.90 2.35 1.49 2.27 0.82
2 1950-02-01 2.43 2.16 2.10 1.58 -0.05 1.14 1.51 1.52 2.02 1.12
3 1950-03-01 -0.81 0.21 -1.67 -0.02 -0.79 0.18 -0.22 1.03 0.12 1.75
4 1950-04-01 5.68 6.45 5.41 5.94 6.10 5.87 3.82 3.34 3.44 3.97
5 1950-05-01 3.84 1.60 1.64 3.33 2.54 2.12 4.46 2.83 3.82 4.75
6 1950-06-01 -9.88 -10.56 -8.02 -7.86 -7.27 -7.44 -7.13 -7.76 -6.32 -5.04
7 1950-07-01 9.09 8.76 7.31 5.88 3.84 4.61 3.09 3.07 1.41 0.42
598 1999-10-01 -0.95 -1.88 -1.25 -0.52 1.65 0.72 5.41 4.38 5.58 6.59
599 1999-11-01 11.57 9.15 8.17 7.14 6.15 4.95 5.78 4.21 1.55 2.15
600 1999-12-01 12.32 14.97 9.29 11.77 11.09 5.89 11.88 11.26 6.23 5.64
The main question is, we would like to compute/plot efficient frontiers for these 4 time periods to see how the efficient frontier has evolved in 1 graph. Are there ways to do this in R?
The efficient frontier is the set of optimal portfolios that offers the highest expected return for a defined level of risk or the lowest risk for a given level of expected return.
In modern portfolio theory, the efficient frontier (or portfolio frontier) is an investment portfolio which occupies the 'efficient' parts of the risk-return spectrum. Formally, it is the set of portfolios which satisfy the condition that no other portfolio exists with a higher expected return but with the same standard deviation of return.
So, how would one go about computing this in R?
dput sample data (first 50 rows)
> dput(head(data,50))
structure(list(X__1 = structure(c(-631152000, -628473600, -626054400,
-623376000, -620784000, -618105600, -615513600, -612835200, -610156800,
-607564800, -604886400, -602294400, -599616000, -596937600, -594518400,
-591840000, -589248000, -586569600, -583977600, -581299200, -578620800,
-576028800, -573350400, -570758400, -568080000, -565401600, -562896000,
-560217600, -557625600, -554947200, -552355200, -549676800, -546998400,
-544406400, -541728000, -539136000, -536457600, -533779200, -531360000,
-528681600, -526089600, -523411200, -520819200, -518140800, -515462400,
-512870400, -510192000, -507600000, -504921600, -502243200), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), X__2 = c(5.92, 2.43, -0.81, 5.68,
3.84, -9.88, 9.09, 4.93, 3.99, -0.5, 3.09, 15.77, 8.22, 0.36,
-7.36, 3.84, -2.81, -7.12, 3.57, 6.59, 1.04, -1.41, -1.42, -0.53,
1.86, -3.25, 0.68, -4.4, 0.57, 2.5, -0.36, -0.74, -1.11, -0.58,
3.22, 0.33, 5.01, 2.75, -1.25, -2.13, 1.3, -4.42, 0.25, -5.56,
-4.09, 2.71, 2.01, -3.15, 8.48, -0.16), X__3 = c(6.35, 2.16,
0.21, 6.45, 1.6, -10.56, 8.76, 4.63, 3.52, -1.2, 3.36, 10.98,
8.41, 0.81, -4.01, 3.56, -4.27, -6.11, 4.7, 5.3, 2.73, -3.07,
-0.13, 0.6, 1.1, -2.77, 2.37, -4.5, 1.87, 3.18, 1.51, 0.43, -1.91,
-1.52, 4.91, 1.43, 3.4, 3.03, -2.25, -2, 0.34, -4.75, 2.24, -6.53,
-1.87, 1.97, 1.78, -2.96, 7.38, 0.43), X__4 = c(4.61, 2.1, -1.67,
5.41, 1.64, -8.02, 7.31, 4.56, 5.18, -0.46, 3.52, 10.78, 8.46,
0.28, -4.88, 4.26, -3.25, -6.76, 6.78, 4.99, 3.86, -2.57, 0.59,
0.16, 1.75, -2.04, 2.49, -5.29, 1.76, 2.88, 0.76, 0.67, -1.67,
-1.45, 5.69, 2.95, 3.66, 1.15, -1.58, -2.34, 0.51, -3.82, 0.72,
-6.25, -2.33, 3.1, 2.19, -2.63, 7.3, 1.82), X__5 = c(4.08, 1.58,
-0.02, 5.94, 3.33, -7.86, 5.88, 4.68, 5.99, 0.75, 2.68, 9.29,
8, 1.08, -3.13, 4.21, -3.35, -5.01, 5.77, 4.85, 2.73, -3.44,
0.27, 1.56, 1.62, -2.35, 2.93, -4.62, 2.36, 2.56, 0.86, 0.16,
-1.8, -2.04, 5.12, 2.72, 3.21, 1.21, -2.17, -1.84, 0.32, -3.63,
1.47, -5.16, -0.65, 3.33, 1.34, -1.36, 6.24, 1.19), X__6 = c(5.47,
-0.05, -0.79, 6.1, 2.54, -7.27, 3.84, 6.29, 4.46, -0.24, 2.42,
6.12, 8.63, 0.88, -3.31, 4.56, -2.14, -5.62, 5.73, 5.36, 2.44,
-1.88, 0.83, 0.65, 1.47, -1.81, 2.31, -4.48, 2.56, 2.69, 0.9,
0.34, -0.62, -1.58, 6.59, 0.86, 3.58, 1.92, -1.85, -2.79, 0.7,
-3.4, 1.26, -5.26, -1.18, 4.26, 1.35, -0.97, 6.66, 1.77), X__7 = c(3.9,
1.14, 0.18, 5.87, 2.12, -7.44, 4.61, 4.57, 6.14, -0.84, 4.22,
8.37, 7.44, 0.69, -4.26, 4.13, -2.24, -6.75, 5.81, 4.35, 1.98,
-2.87, 0.93, 0.61, 1.27, -2.18, 2.97, -4.09, 2.27, 2.96, 1.16,
-0.38, -2.37, -0.71, 5.53, 2.45, 1.3, 0.31, -0.47, -2.03, 0.14,
-3.26, 1.79, -5.5, -1.47, 4.18, 1.96, -1.35, 7.06, 1.69), X__8 = c(2.35,
1.51, -0.22, 3.82, 4.46, -7.13, 3.09, 5.01, 5.84, -1.05, 3.81,
7.54, 6.46, 0.71, -3.56, 4.42, -1.87, -4.52, 7.3, 3.66, 2.11,
-2.92, 2.25, 2.17, 1.32, -1.71, 3.17, -4.63, 2.59, 3.89, 0.49,
0.21, -1.71, -1.18, 4.95, 3.21, 1.41, 0.89, -1.02, -2.89, 0.59,
-2.67, 1.47, -4.62, -0.69, 4.07, 2.83, -1.44, 6.11, 1.58), X__9 = c(1.49,
1.52, 1.03, 3.34, 2.83, -7.76, 3.07, 3.72, 6.21, -1.66, 3.46,
6.14, 7.17, 2.13, -3.19, 4.59, -2.65, -3.5, 7.43, 3.5, 2.41,
-2.73, 1.35, 1.97, 1.72, -1.8, 4.06, -5.35, 2.57, 3.14, 1.89,
-0.86, -1.73, -0.95, 6.07, 1.73, 1.09, 0.37, -1.34, -2.48, 0.31,
-3.2, 1.34, -4.99, -0.18, 4.35, 3.03, 0.09, 5.65, 2.39), X__10 = c(2.27,
2.02, 0.12, 3.44, 3.82, -6.32, 1.41, 4.54, 5.55, -0.97, 3.8,
5.69, 5.65, 1.78, -2.6, 4.21, -1.29, -2.63, 7.15, 3.52, 1.85,
-2.32, 0.96, 2.74, 1.9, -2.6, 3.83, -4.31, 3.15, 2.76, 0.93,
-0.39, -1.86, -1.57, 7.05, 2.36, -0.33, -0.23, -0.54, -2.6, 0.61,
-2.37, 2.12, -3.76, 0.47, 3.98, 3.03, 0.2, 5.63, 1.26), X__11 = c(0.82,
1.12, 1.75, 3.97, 4.75, -5.04, 0.42, 4.96, 4.32, 0.25, 2.26,
4.71, 5.05, 1.63, -1.53, 5.12, -2.59, -1.92, 6.89, 4.48, -0.09,
-2.49, 0.26, 4.03, 1.37, -2.82, 4.95, -5.1, 3.4, 4.29, 0.89,
-1.06, -2.18, -0.31, 5.76, 3.32, -1.04, -0.63, -1.78, -2.97,
0.55, -1.3, 2.75, -4.47, 0.48, 4.83, 2.85, 0.27, 4.4, 1.93)), .Names = c("date",
"X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8",
"X9", "X10"), row.names = c(NA, 50L), class = c("tbl_df",
"tbl", "data.frame"))
After a few correpondence via the comments with #Jonathan, I widened the example data from 3 columns to 12 columns with some sampling. And the code at the "With short-selling" section at the blog scales well for 10K observations:
# using code at:
# https://www.r-bloggers.com/a-gentle-introduction-to-finance-using-r-efficient-frontier-and-capm-part-1/
# https://datashenanigan.wordpress.com/2016/05/24/a-gentle-introduction-to-finance-using-r-efficient-frontier-and-capm-part-1/
library(data.table)
calcEFParams <- function(rets)
{
retbar <- colMeans(rets, na.rm = T)
covs <- var(rets, na.rm = T) # calculates the covariance of the returns
invS <- solve(covs)
i <- matrix(1, nrow = length(retbar))
alpha <- t(i) %*% invS %*% i
beta <- t(i) %*% invS %*% retbar
gamma <- t(retbar) %*% invS %*% retbar
delta <- alpha * gamma - beta * beta
retlist <- list(alpha = as.numeric(alpha),
beta = as.numeric(beta),
gamma = as.numeric(gamma),
delta = as.numeric(delta))
return(retlist)
}
# load data
link <- "https://raw.githubusercontent.com/DavZim/Efficient_Frontier/master/data/mult_assets.csv"
df <- data.table(read.csv(link))
df2 <- df[,lapply(.SD, sample),]
df3 <- cbind(df, df2)
df4 <- df3[,lapply(.SD, sample),]
df5 <- cbind(df3, df4)
Now loading the microbenchmark package, the performance is as such:
> library(microbenchmark)
> microbenchmark(calcEFParams(df5), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
calcEFParams(df5) 2.692514 2.764053 2.795127 2.777547 2.805447 3.024349 10
It seems that David Zimmermann's code is scalable and efficient enough!

R: finding intersection between two vectors [duplicate]

This question already has answers here:
Why are these numbers not equal?
(6 answers)
Closed 5 years ago.
v1 = c(2, 2.01, 2.02, 2.03, 2.04, 2.05, 2.06, 2.07, 2.08, 2.09, 2.1,
2.11, 2.12, 2.13, 2.14, 2.15, 2.16, 2.17, 2.18, 2.19, 2.2, 2.21,
2.22, 2.23, 2.24, 2.25, 2.26, 2.27, 2.28, 2.29, 2.3, 2.31, 2.32,
2.33, 2.34, 2.35, 2.36, 2.37, 2.38, 2.39, 2.4, 2.41, 2.42, 2.43,
2.44, 2.45, 2.46, 2.47, 2.48, 2.49, 2.5, 2.51, 2.52, 2.53, 2.54,
2.55, 2.56, 2.57, 2.58, 2.59, 2.6, 2.61, 2.62, 2.63, 2.64, 2.65,
2.66, 2.67, 2.68, 2.69, 2.7, 2.71, 2.72, 2.73, 2.74, 2.75, 2.76,
2.77, 2.78, 2.79, 2.8, 2.81, 2.82, 2.83, 2.84, 2.85, 2.86, 2.87,
2.88, 2.89, 2.9, 2.91, 2.92, 2.93, 2.94, 2.95, 2.96, 2.97, 2.98,
2.99)
> intersect(v1, seq(2, 2.99, 0.01))
[1] 2.00 2.01 2.02 2.04 2.05 2.06 2.08 2.09 2.10 2.12 2.13 2.14 2.16 2.17 2.19 2.20 2.21 2.23 2.24 2.25 2.26 2.27
[23] 2.28 2.29 2.30 2.31 2.33 2.34 2.35 2.37 2.38 2.39 2.41 2.42 2.44 2.45 2.46 2.48 2.49 2.50 2.51 2.52 2.53 2.54
[45] 2.55 2.56 2.57 2.58 2.59 2.60 2.62 2.63 2.64 2.66 2.67 2.69 2.70 2.71 2.72 2.73 2.74 2.75 2.76 2.77 2.78 2.79
[67] 2.80 2.81 2.82 2.83 2.84 2.85 2.87 2.88 2.89 2.91 2.92 2.94 2.95 2.96 2.97 2.98 2.99
I have a vector of length 100 called v1. I want to see the intersection of v1 and a seq(2, 2.99, 0.01) vector (should be just v1 itself). But I get a vector that is only 83 elements long? And clearly 2.03, 2.15 ... are not in the intersection. How is that possible?
This a floating point error in r. See the Floating Point Guide for more information.
This can be seen as the error because this returns what you're looking for:
v1 = c(2, 2.01, 2.02, 2.03, 2.04, 2.05, 2.06, 2.07, 2.08, 2.09, 2.1,
2.11, 2.12, 2.13, 2.14, 2.15, 2.16, 2.17, 2.18, 2.19, 2.2, 2.21,
2.22, 2.23, 2.24, 2.25, 2.26, 2.27, 2.28, 2.29, 2.3, 2.31, 2.32,
2.33, 2.34, 2.35, 2.36, 2.37, 2.38, 2.39, 2.4, 2.41, 2.42, 2.43,
2.44, 2.45, 2.46, 2.47, 2.48, 2.49, 2.5, 2.51, 2.52, 2.53, 2.54,
2.55, 2.56, 2.57, 2.58, 2.59, 2.6, 2.61, 2.62, 2.63, 2.64, 2.65,
2.66, 2.67, 2.68, 2.69, 2.7, 2.71, 2.72, 2.73, 2.74, 2.75, 2.76,
2.77, 2.78, 2.79, 2.8, 2.81, 2.82, 2.83, 2.84, 2.85, 2.86, 2.87,
2.88, 2.89, 2.9, 2.91, 2.92, 2.93, 2.94, 2.95, 2.96, 2.97, 2.98,
2.99)
v2 <- seq(2, 2.99, 0.01)
v1 <- round(v1,2) #rounds to 2 decimal places
v2 <- round(v2,2)
intersect(v1,v2) #returns v1

Weird behaviour (bug?) in car::bcPower

Consider the dataset Kort:
structure(list(V1 = c(-0.03, 0.22, -0.11, -0.01, 0.25, 0.29,
-0.74, 0.23, 0.39, -0.04, 0.18, 0.19, 0.4, 0.21, 0.21, -0.01,
-0.05, 0.02, -0.12, 0.37, -0.07, 0.51, 0.39, 0.14, 0.02, 0.73,
-0.25, 0.44, 0.29), V2 = c(35.39, 34.33, 32.74, 34.72, 33.07,
30.9, 29.89, 31.17, 31.62, 33.13, 30.64, 33.31, 33.61, 34.16,
30.06, 30.06, 31.18, 25.57, 30.52, 32.43, 31.54, 29.6, 34.66,
31.74, 27.22, 41, 32.02, 37.96, 29.25), V3 = c(37.24, 36.77,
37.21, 41.16, 40.3, 42.16, 40.77, 39.59, 37, 38.32, 34.6, 38.1,
36.07, 39.2, 36.97, 38.28, 38.72, 46.81, 39.63, 36, 45.33, 38.72,
36.2, 40.94, 37.7, 42.44, 37.92, 39.87, 37.15), V4 = c(-36L,
-18L, -2L, 20L, 37L, 39L, -7L, 31L, -23L, 32L, 73L, 10L, 14L,
18L, 126L, 98L, 13L, 14L, 15L, 37L, 66L, 3L, -50L, 9L, 6L, -20L,
4L, -26L, -2L), V5 = c(12.4, 10.5, 2.8, 9.5, 9.4, 10.7, 7.5,
14.8, 10.9, 13.5, 11.5, 11.8, 13.6, 8.6, 13.6, 13.1, 14.3, 11.3,
16.1, 14.5, 8.4, 15.4, 13.4, 14, 18.8, 17.4, 16.4, 16, 17.7),
V6 = c(27424L, 25597L, 20968L, 24730L, 25423L, 25801L, 23681L,
29527L, 26228L, 28262L, 27363L, 27134L, 27542L, 24647L, 28260L,
27922L, 29054L, 25650L, 30096L, 29103L, 24112L, 30035L, 28771L,
27818L, 32455L, 29722L, 30508L, 29896L, 31961L), V7 = c(68.8,
70.4, 61.6, 73.5, 71.8, 76.5, 72.7, 75.3, 71.7, 75, 72.9,
73.3, 73.7, 69, 72.7, 74.2, 73.4, 71.2, 76.4, 73, 62.5, 76,
73.7, 74.7, 74.3, 74.8, 74.6, 74.4, 74.4), V8 = c(8.1, 6.8,
11, 5.3, 6.3, 4.1, 5.5, 4, 5.9, 4.3, 5.5, 5.4, 4.2, 8.1,
5.2, 4.8, 4.4, 8.2, 3.8, 5.9, 12.9, 4.3, 5.2, 5, 3.6, 3.8,
4.6, 4.3, 4.5), V9 = c(0.38, 0.15, 0.16, 0.08, 0.12, 0.05,
0.07, 0.04, 0.08, 0.07, 0.13, 0.08, 0.08, 0.26, 0.05, 0.14,
0.05, 0.26, 0.03, 0.18, 0.26, 0.04, 0.04, 0.14, 0.05, 0,
0.02, 0.02, 0.1), V10 = c(9.8, 9.9, 19.4, 7, 9.2, 3, 8.5,
1.1, 3, 2.3, 5.1, 5.6, 1, 22.3, 4.4, 6.2, 2.2, 5.3, 1.5,
5, 18.7, 1.5, 3, 8.9, 1.6, 0, 5.1, 2.1, 3.6), V11 = c(6.3,
7.5, 5.5, 10.2, 5, 9.6, 9.3, 4.8, 4.3, 4.6, 4.1, 5.7, 6.4,
4, 7.2, 4.7, 4.2, 4.5, 7.6, 5.3, 6.2, 4.1, 4.9, 4.1, 5.1,
3.3, 5.4, 5, 5.6), V12 = c(153605L, 152867L, 115972L, 140341L,
139245L, 167038L, 143239L, 179712L, 135273L, 167487L, 160738L,
160648L, 154717L, 118800L, 168954L, 148412L, 147637L, 142615L,
210838L, 161840L, 114310L, 182670L, 160293L, 147747L, 192889L,
191077L, 164107L, 202051L, 192945L)), .Names = c("V1", "V2",
"V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "V11", "V12"
), class = "data.frame", row.names = c(NA, -29L))
Where the response is:
Kort$V12
[1] 153605 152867 115972 140341 139245 167038 143239 179712 135273 167487
[11] 160738 160648 154717 118800 168954 148412 147637 142615 210838 161840
[21] 114310 182670 160293 147747 192889 191077 164107 202051 192945
Doing a box-cox transform, using car::boxcox
boxcox(V12~.,data=Kort,lambda=seq(-4,4,4/10))
yields an optimal parameter of -2. Transforming the response using
car::bcPower
TVP<-bcPower(Kort$V12,lambda=-2)
turns TVP into a vector of constants:
TVP
[1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
[20] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
but box cox transform should be a continuous map!
I don't think this is a bug, there's simply a limit to how many decimal places are printed out. The help file suggests that the calculation is (U^(lambda)-1)/lambda which is pretty close to 1/2 where U is large. You can see that TVP is being calculated correctly with
TVP-0.5
# [1] -2.119138e-11 -2.139650e-11 -3.717610e-11 ...
or
options(digits=20)
TVP
# [1] 0.49999999997880861802 0.49999999997860350431 0.49999999996282390446 ...

Resources