I am rather new to R and really could need the help of the community with the following problem. I am trying to solve for the variable r in the following equation: (EPS2 + r*DPS1-EPS1)/r^2)-PRC. Here is my (unsuccessful) attempt on solving the problem (using the uniroot function):
EPS2 = df_final$EPS2
DPS1 = df_final$DPS1
EPS1 = df_final$EPS1
PRC = df_final$PRC
f1 = function(r) {
((df_final_test$EPS2 + r * df_final_test$DPS1-df_final_test$EPS1)/r^2)-df_final_test$PRC
}
uniroot(f1,interval = c(1e-8,100000),EPS2, DPS1, EPS1, PRC , extendInt="downX")$root
I then get the following error: Error in f(lower, ...) : unused
arguments (c(" 1.39", " 1.39", ...
I am grateful for any tips and hints you guys could give me in regard to this problem. Or whether a different function/package would be better in this case.
Added a reprex (?) in case that helps anybody in helping me with this issue:
df <- structure(list(EPS1 = c(6.53, 1.32, 1.39, 1.71, 2.13), DPS1 = c(2.53, 0.63,
0.81, 1.08, 1.33, 19.8), EPS2 = c(7.57,1.39,1.43,1.85,2.49), PRC = c(19.01,38.27,44.82,35.27,47.12)), .Names = c("EPS1", "DPS1", "EPS2", "PRC"), row.names = c(NA,
-5L), class = "data.frame")
I don't think you can use uniroot if all coefficients are vectors rather than scalars. In this case, a straightforward approach is solving them in a math way, i.e.,
r1 <- (DPS1 + sqrt(DPS1^2-4*PRC*(EPS1-EPS2)))/(2*PRC)
and
r2 <- (DPS1 - sqrt(DPS1^2-4*PRC*(EPS1-EPS2)))/(2*PRC)
where r1 and r2 are two roots.
Disclaimer: I have no experience with uniroot() and have not idea if the following makes sense, but it runs! The idea was to basically call uniroot for each row of the data frame.
Note that I modified the function f1 slightly so each of the additional parameters has are to be passed as arguments of the function and do not rely on finding the objects with the same name in the parent environment. I also use with to avoid calling df$... for every variable.
library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.1.0
library(furrr)
#> Loading required package: future
df <- structure(list(EPS1 = c(6.53, 1.32, 1.39, 1.71, 2.13),
DPS1 = c(2.53, 0.63, 0.81, 1.08, 1.33, 19.8),
EPS2 = c(7.57,1.39,1.43,1.85,2.49),
PRC = c(19.01,38.27,44.82,35.27,47.12)),
.Names = c("EPS1", "DPS1", "EPS2", "PRC"),
row.names = c(NA,-5L), class = "data.frame")
df
#> Warning in format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :
#> corrupt data frame: columns will be truncated or padded with NAs
#> EPS1 DPS1 EPS2 PRC
#> 1 6.53 2.53 7.57 19.01
#> 2 1.32 0.63 1.39 38.27
#> 3 1.39 0.81 1.43 44.82
#> 4 1.71 1.08 1.85 35.27
#> 5 2.13 1.33 2.49 47.12
f1 = function(r, EPS2, DPS1, EPS1, PRC) {
(( EPS2 + r * DPS1 - EPS1)/r^2) - PRC
}
# try for first row
with(df,
uniroot(f1,
EPS2=EPS2[1], DPS1=DPS1[1], EPS1=EPS1[1], PRC=PRC[1],
interval = c(1e-8,100000),
extendInt="downX")$root)
#> [1] 0.3097291
# it runs!
# loop over each row
vec_sols <- rep(NA, nrow(df))
for (i in seq_along(1:nrow(df))) {
sol <- with(df, uniroot(f1,
EPS2=EPS2[i], DPS1=DPS1[i], EPS1=EPS1[i], PRC=PRC[i],
interval = c(1e-8,100000),
extendInt="downX")$root)
vec_sols[i] <- sol
}
vec_sols
#> [1] 0.30972906 0.05177443 0.04022946 0.08015686 0.10265226
# Alternatively, you can use furrr's future_map_dbl to use multiple cores.
# the following will basically do the same as the above loop.
# here with 4 cores.
plan(multisession, workers = 4)
vec_sols <- 1:nrow(df) %>% furrr::future_map_dbl(
.f = ~with(df,
uniroot(f1,
EPS2=EPS2[.x], DPS1=DPS1[.x], EPS1=EPS1[.x], PRC=PRC[.x],
interval = c(1e-8,100000),
extendInt="downX")$root
))
vec_sols
#> [1] 0.30972906 0.05177443 0.04022946 0.08015686 0.10265226
# then apply the solutions back to the dataframe (each row to each solution)
df %>% mutate(
root = vec_sols
)
#> Warning in format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :
#> corrupt data frame: columns will be truncated or padded with NAs
#> EPS1 DPS1 EPS2 PRC root
#> 1 6.53 2.53 7.57 19.01 0.30972906
#> 2 1.32 0.63 1.39 38.27 0.05177443
#> 3 1.39 0.81 1.43 44.82 0.04022946
#> 4 1.71 1.08 1.85 35.27 0.08015686
#> 5 2.13 1.33 2.49 47.12 0.10265226
Created on 2021-06-20 by the reprex package (v2.0.0)
Related
How can I set parameter with say 3 float values.
For example I want to search parameter X for 0.99, 0.98 and 0.97.
For p_dbl there are lower and upper parameters but not values I can use.
For example something like (this doesn't work ofc):
p_dbl(c(0.99, 0.98, 0.97))
I am assuming you want to do this for tuning purposes. If you don't, you have to use a p_dbl() and set the levels as characters ("0.99", ...).
Note that the second solution that I am giving here is simply a shortcut for the first approach, where the transformation is explicitly defined. This means that the paradox package will create the transformation for you, similarly to how it is done in the first of the two solutions.
library(paradox)
library(data.table)
search_space = ps(
a = p_fct(levels = c("0.99", "0.98", "0.97"), trafo = function(x, param_set) {
switch(x,
"0.99" = 0.99,
"0.98" = 0.98,
"0.97" = 0.97
)
})
)
design = rbindlist(generate_design_grid(search_space, 3)$transpose(), fill = TRUE)
design
#> a
#> 1: 0.99
#> 2: 0.98
#> 3: 0.97
class(design[[1]])
#> [1] "numeric"
# the same can be achieves as follows:
search_space = ps(
a = p_fct(levels = c(0.99, 0.98, 0.97))
)
design = rbindlist(generate_design_grid(search_space, 3)$transpose(), fill = TRUE)
design
#> a
#> 1: 0.99
#> 2: 0.98
#> 3: 0.97
class(design[[1]])
#> [1] "numeric"
Created on 2023-02-15 with reprex v2.0.2
I have a dataset in this format:
structure(list(id = 1:4, var1_before = c(-0.16, -0.31, -0.26,
-0.77), var2_before = c(-0.7, -1.06, -0.51, -0.81), var3_before = c(2.47,
2.97, 2.91, 3.01), var4_before = c(-1.08, -1.22, -0.92, -1.16
), var5_before = c(0.54, 0.4, 0.46, 0.79), var1_after = c(-0.43,
-0.18, -0.59, 0.64), var2_after = c(-0.69, -0.38, -1.19, -0.77
), var3_after = c(2.97, 3.15, 3.35, 1.52), var4_after = c(-1.11,
-0.99, -1.26, -0.39), var5_after = c(1.22, 0.41, 1.01, 0.24)), class = "data.frame", row.names = c(NA,
-4L))
Every id is unique.
I would like to make two clusters:
First cluster for variables: var1_before, var2_before, var3_before, var4_before, var5_before
Second cluster for variables: var1_after, var2_after, var3_after, var4_after, var5_after
I used two-step cluster in spss for this.
How is it possible to make it in R?
This question is quite complex, this is how I'd approach the problem, hoping to help and maybe to start a discussion about it.
Note:
this is how I think I could approach the problem;
I do not know the two-step clustering, so I use a kmeans;
it's based on your data, but you can easily generalize it: I've made it dependent to your data because it's simpler to explain.
So, you create the first clustering with the before variables, then the value of the variable changes (after variables), and you want to see if the id are in the same cluster.
This leads me to think that you only need the first set of clusters (for the before variables), then see if the ids have changed: no need to do a second clustering, but only see if they've changed from the one cluster to another.
# first, you make your model of clustering, I'm using a simple kmeans
set.seed(1234)
model <- kmeans(df[,2:6],2)
# you put the clusters in the dataset
df$before_cluster <- model$cluster
Now the idea is to calculate the Euclidean distance from the ids with the new variables (after variables), to the centroids calculated on the before variabiles:
# for the first cluster
cl1 <- list()
for (i in 1:nrow(df)) {
cl1[[i]] <- dist(rbind(df[i,7:11], model$centers[1,] ))
}
cl1 <- do.call(rbind, cl1)
colnames(cl1) <- 'first'
# for the second cluster
cl2 <- list()
for (i in 1:nrow(df)) {
cl2[[i]] <- dist(rbind(df[i,7:11], model$centers[2,] ))
}
cl2 <- do.call(rbind, cl2)
colnames(cl2) <- 'second'
# put them together
df <- cbind(df, cl1, cl2)
Now the last part, you can define if one has changed the cluster, getting the smallest distance from the centroids (smallest --> it's the new cluster), fetching the "new" cluster.
df$new_cl <- ifelse(df$first < df$second, 1,2)
df
id var1_before var2_before var3_before var4_before var5_before var1_after var2_after var3_after var4_after var5_after first second before_cluster first second new_cl
1 1 -0.16 -0.70 2.47 -1.08 0.54 -0.43 -0.69 2.97 -1.11 1.22 0.6852372 0.8151840 2 0.6852372 0.8151840 1
2 2 -0.31 -1.06 2.97 -1.22 0.40 -0.18 -0.38 3.15 -0.99 0.41 0.7331098 0.5208887 1 0.7331098 0.5208887 2
3 3 -0.26 -0.51 2.91 -0.92 0.46 -0.59 -1.19 3.35 -1.26 1.01 0.6117598 1.1180004 2 0.6117598 1.1180004 1
4 4 -0.77 -0.81 3.01 -1.16 0.79 0.64 -0.77 1.52 -0.39 0.24 2.0848381 1.5994765 1 2.0848381 1.5994765 2
Seems they all have changed cluster.
I have 48 dataframes and I wish to calculate a linear regression for each of the stocks in each of the dataframes (the CAPM). Each dataframe contains the same amount of stocks which is around 470, the S&P 500 and has 36 months worth of data. Originally I had one large dataframe but I have successfully managed to split the data into the 48 dataframes (this might not have been the smartest move but it is the way I solved the problem).
When I run the following code, it works fine. Noting that I have hard coded in Block 1.
beta_results <- lapply(symbols, function(x) {
temp <- as.data.frame(Block1)
input <- as.formula(paste("temp$",x, "~ temp$SP500" ))
capm <- lm(input)
coefficients(capm)
})
Now rather than change the coding for each of the 48 blocks (ie Block1 to Block2 etc), I attempted the following, which in hindsight is complete rubbish. What I need is a way to increment the i from 1 to 48. I had tried to put all the dataframes in list, but given the way I have regression working I would be processing two lists and that was beyond me.
beta_results <- lapply(seq_along(symbols), function(i,x) {
temp <- as.data.frame(paste0("Block",i))
input <- as.formula(paste("temp$",x, "~ temp$SP500" ))
capm <- lm(input)
coefficients(capm)
})
Code for some example dataframes etc are:
symbols <- c("A", "AAPL", "BRKB")
Block1 to BlockN would take the form of
A AAPL BRKB SP500
2016-04-29 -0.139 0.111 0.122 0.150
2016-05-31 0.071 0.095 0.330 0.200
2016-06-30 -0.042 -0.009 0.230 0.150
2016-07-29 0.090 0.060 0.200 0.100
2016-08-31 0.023 0.013 0.005 0.050
2016-09-30 0.065 0.088 0.002 0.100
Consider a nested lapply where outer loop iterates through a list of dataframes and inner loop through each symbol. The result is a 48-member list, each containing 470 sets of beta coefficents.
Also, as an aside, it is preferred to use lists of many similiarly structured objects especially to run same operations and avoid flooding your global environment (manage 1 list vs 48 dataframes):
# LIST OF DATA FRAMES FROM ALL GLOBAL VARIABLES CONTAINING "Block"
dfList <- mget(ls(pattern="Block"))
# NESTED LAPPLY
results_list <- lapply(dfList, function(df) {
beta_results <- lapply(symbols, function(x) {
input <- reformulate(quote(SP500), response=x)
capm <- lm(input, data=df)
coefficients(capm)
})
})
#Parfait's answer is the correct one for OPs question of using lapply to process a list of data frames.
The following example shows how data.table can be used to get the coefficients of lm(stock~SP500) for each stock (using the Block1 example data):
library(data.table)
dt <- structure(list(date = c("2016-04-29", "2016-05-31", "2016-06-30",
"2016-07-29", "2016-08-31", "2016-09-30"), A = c(-0.139, 0.071,
-0.042, 0.09, 0.023, 0.065), AAPL = c(0.111, 0.095, -0.009, 0.06,
0.013, 0.088), BRKB = c(0.122, 0.33, 0.23, 0.2, 0.005, 0.002),
SP500 = c(0.15, 0.2, 0.15, 0.1, 0.05, 0.1)), .Names = c("date",
"A", "AAPL", "BRKB", "SP500"), row.names = c(NA, -6L), class = "data.frame")
setDT(dt)
# Convert to long format for easier lm
dt_melt <- melt(dt, id.vars = c("date", "SP500"))
# Extract coefficients by doing lm for each unique variable (i.e. stock)
dt_lm <- dt_melt[, as.list(coefficients(lm(value~SP500))), by = variable]
# Fix column names
setnames(dt_lm, c("stock", "intercept", "slope"))
> dt_lm
stock intercept slope
1: A 0.05496970 -0.3490909
2: AAPL 0.01421212 0.3636364
3: BRKB -0.10751515 2.0454545
I have a data frame of 3 points in space represented by their longitude and latitute:
myData <- structure(list(lng = c(-37.06852042, -37.07473406, -37.07683313
), lat = c(-11.01471746, -11.02468103, -11.02806217)), .Names = c("lng",
"lat"), row.names = c(NA, 3L), class = "data.frame")
Next, I use the geosphere package to get a distance matrix (in meters, which I convert to km) for the points:
> m <- round(distm(myData)/1000,2)
> rownames(m) <- c("A", "B", "C")
> colnames(m) <- c("A", "B", "C")
> m
A B C
A 0.00 1.30 1.74
B 1.30 0.00 0.44
C 1.74 0.44 0.00
Given this is a distance matrix and I have 6 ways of going to A, B and C (like A -> B -> C, C -> A >-B, and so on), I would like to extract some information from it, like the minimum, the median, and the maximum distance.
To illustrate it, I calculated all the possible ways of my example manually:
ways <- c(abc <- 1.3 + 0.44,
acb <- 1.74 + 0.44,
bac <- 1.3 + 1.74,
bca <- 0.44 + 1.74,
cab <- 1.74 + 1.3,
cba <- 0.44 + 1.3)
> min(ways)
[1] 1.74
> median(ways)
[1] 2.18
> max(ways)
[1] 3.04
How do I automate this task, given that I'll be working with more than 10 locals and this problem has factorial complexity?
I wrote a package called trotter that maps integers to different arrangement types (permutations, combinations and others). For this problem, it seems that you are interested in the permutations of locations. One of the objects in the package is the permutation pseudo-vector that is created using the function ppv.
First install "trotter":
install.packages("trotter")
Then an automated version of your task might look something like:
library(geosphere)
myData <- data.frame(
lng = c(-37.06852042, -37.07473406, -37.07683313),
lat = c(-11.01471746, -11.02468103, -11.02806217)
)
m <- round(distm(myData) / 1000, 2)
locations <- c("A", "B", "C")
rownames(m) <- colnames(m) <- locations
library(trotter)
perms <- ppv(k = length(locations), items = locations)
ways <- c()
for (i in 1:length(perms)) {
perm <- perms[i]
route <- paste(perm, collapse = "")
ways[[route]] <- sum(
sapply(
1:(length(perm) - 1),
function(i) m[perm[i], perm[i + 1]]
)
)
}
Back in the R console:
> ways
ABC ACB CAB CBA BCA BAC
1.74 2.18 3.04 1.74 2.18 3.04
> # What is the minimum route length?
> min(ways)
[1] 1.74
> # Which route (index) is this?
> which.min((ways))
ABC
1
Just remember, like you said, you're dealing with factorial complexity and you might end up waiting a while running this brute force search with more than a few locations...
Here is the test I'm interested in:
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm
How can I adapt this code into a function that accepts a vector of numeric values and returns a logical vector specifying which data points to remove?
I have attempted to do this below, but I'm getting stuck because when I sort the vector to return, it doesn't line up with the input vector data.
# input data
y = c(-0.25, 0.68, 0.94, 1.15, 1.20, 1.26, 1.26,
1.34, 1.38, 1.43, 1.49, 1.49, 1.55, 1.56,
1.58, 1.65, 1.69, 1.70, 1.76, 1.77, 1.81,
1.91, 1.94, 1.96, 1.99, 2.06, 2.09, 2.10,
2.14, 2.15, 2.23, 2.24, 2.26, 2.35, 2.37,
2.40, 2.47, 2.54, 2.62, 2.64, 2.90, 2.92,
2.92, 2.93, 3.21, 3.26, 3.30, 3.59, 3.68,
4.30, 4.64, 5.34, 5.42, 6.01)
## Generate normal probability plot.
qqnorm(y)
removeoutliers = function(dfinputcol) {
y = as.vector(dfinputcol)
## Create function to compute the test statistic.
rval = function(y){
ares = abs(y - mean(y))/sd(y)
df = data.frame(y, ares)
r = max(df$ares)
list(r, df)}
## Define values and vectors.
n = length(y)
alpha = 0.05
lam = c(1:10)
R = c(1:10)
## Compute test statistic until r=10 values have been
## removed from the sample.
for (i in 1:10){
if(i==1){
rt = rval(y)
R[i] = unlist(rt[1])
df = data.frame(rt[2])
newdf = df[df$ares!=max(df$ares),]}
else if(i!=1){
rt = rval(newdf$y)
R[i] = unlist(rt[1])
df = data.frame(rt[2])
newdf = df[df$ares!=max(df$ares),]}
## Compute critical value.
p = 1 - alpha/(2*(n-i+1))
t = qt(p,(n-i-1))
lam[i] = t*(n-i) / sqrt((n-i-1+t**2)*(n-i+1))
}
## Print results.
newdf = data.frame(c(1:10),R,lam)
names(newdf)=c("Outliers","TestStat.", "CriticalVal.")
# determine how many outliers to remove
toremove = max(newdf$Outliers[newdf$TestStat. > newdf$CriticalVal.])
# create vector of same size as input vector
logicalvectorTifshouldremove = logical(length=length(y))
# but how to determine which outliers to remove?
# set largest data points as outliers to remove.. but could be the smallest in some data sets..
logicalvectorTifshouldremove = replace(logicalvectorTifshouldremove, tail(sort(y), toremove), TRUE)
return (logicalvectorTifshouldremove)
}
# this should have 3 data points set to TRUE .. but it has 2 and they aren't the correct ones
output = removeoutliers(y)
length(output[output==T])
I think it is written in the page you gave (not exactly but in two sentences):
Remove the r observations that maximizes |x_i - mean(x)|
So you get the data without the outliers simply by removing the r ones which fall over the difference, using:
y[abs(y-mean(y)) < sort(abs(y-mean(y)),decreasing=TRUE)[toremove]]
You do not need the last two lines of your code. By the way, you can compute directly:
toremove = which(newdf$TestStat > newdf$CriticalVal)
To simplify a bit, the final function could be:
# Compute the critical value for ESD Test
esd.critical <- function(alpha, n, i) {
p = 1 - alpha/(2*(n-i+1))
t = qt(p,(n-i-1))
return(t*(n-i) / sqrt((n-i-1+t**2)*(n-i+1)))
}
removeoutliers = function(y) {
## Define values and vectors.
y2 = y
n = length(y)
alpha = 0.05
toremove = 0
## Compute test statistic until r=10 values have been
## removed from the sample.
for (i in 1:10){
if(sd(y2)==0) break
ares = abs(y2 - mean(y2))/sd(y2)
Ri = max(ares)
y2 = y2[ares!=Ri]
## Compute critical value.
if(Ri>esd.critical(alpha,n,i))
toremove = i
}
# Values to keep
if(toremove>0)
y = y[abs(y-mean(y)) < sort(abs(y-mean(y)),decreasing=TRUE)[toremove]]
return (y)
}
which returns:
> removeoutliers(y)
[1] -0.25 0.68 0.94 1.15 1.20 1.26 1.26 1.34 1.38 1.43 1.49
[12] 1.49 1.55 1.56 1.58 1.65 1.69 1.70 1.76 1.77 1.81 1.91
[23] 1.94 1.96 1.99 2.06 2.09 2.10 2.14 2.15 2.23 2.24 2.26
[34] 2.35 2.37 2.40 2.47 2.54 2.62 2.64 2.90 2.92 2.92 2.93
[45] 3.21 3.26 3.30 3.59 3.68 4.30 4.64
You can use winsorize on library robustHD
library('robustHD')
set.seed(1234)
x <- rnorm(10)
x[1] <- x[1] * 10
x[2] <- x[2] * 11
x[10] <- x[10] * 10
x
[1] -12.0706575 3.0517217 1.0844412 -2.3456977 0.4291247 0.5060559 -0.5747400 -0.5466319 -0.5644520 -8.9003783
boxplot(x)
y <- winsorize(x)
y
[1] -4.5609058 3.0517217 1.0844412 -2.3456977 0.4291247 0.5060559 -0.5747400 -0.5466319 -0.5644520 -4.5609058
boxplot(y)
so if you have dataframe or vector you can use sapply to perform winsorize function.
For more information about this library you can follow this link http://cran.r-project.org/web/packages/robustHD/index.html