Using seq_along and lapply to process multiple dataframes (CAPM) - r

I have 48 dataframes and I wish to calculate a linear regression for each of the stocks in each of the dataframes (the CAPM). Each dataframe contains the same amount of stocks which is around 470, the S&P 500 and has 36 months worth of data. Originally I had one large dataframe but I have successfully managed to split the data into the 48 dataframes (this might not have been the smartest move but it is the way I solved the problem).
When I run the following code, it works fine. Noting that I have hard coded in Block 1.
beta_results <- lapply(symbols, function(x) {
temp <- as.data.frame(Block1)
input <- as.formula(paste("temp$",x, "~ temp$SP500" ))
capm <- lm(input)
coefficients(capm)
})
Now rather than change the coding for each of the 48 blocks (ie Block1 to Block2 etc), I attempted the following, which in hindsight is complete rubbish. What I need is a way to increment the i from 1 to 48. I had tried to put all the dataframes in list, but given the way I have regression working I would be processing two lists and that was beyond me.
beta_results <- lapply(seq_along(symbols), function(i,x) {
temp <- as.data.frame(paste0("Block",i))
input <- as.formula(paste("temp$",x, "~ temp$SP500" ))
capm <- lm(input)
coefficients(capm)
})
Code for some example dataframes etc are:
symbols <- c("A", "AAPL", "BRKB")
Block1 to BlockN would take the form of
A AAPL BRKB SP500
2016-04-29 -0.139 0.111 0.122 0.150
2016-05-31 0.071 0.095 0.330 0.200
2016-06-30 -0.042 -0.009 0.230 0.150
2016-07-29 0.090 0.060 0.200 0.100
2016-08-31 0.023 0.013 0.005 0.050
2016-09-30 0.065 0.088 0.002 0.100

Consider a nested lapply where outer loop iterates through a list of dataframes and inner loop through each symbol. The result is a 48-member list, each containing 470 sets of beta coefficents.
Also, as an aside, it is preferred to use lists of many similiarly structured objects especially to run same operations and avoid flooding your global environment (manage 1 list vs 48 dataframes):
# LIST OF DATA FRAMES FROM ALL GLOBAL VARIABLES CONTAINING "Block"
dfList <- mget(ls(pattern="Block"))
# NESTED LAPPLY
results_list <- lapply(dfList, function(df) {
beta_results <- lapply(symbols, function(x) {
input <- reformulate(quote(SP500), response=x)
capm <- lm(input, data=df)
coefficients(capm)
})
})

#Parfait's answer is the correct one for OPs question of using lapply to process a list of data frames.
The following example shows how data.table can be used to get the coefficients of lm(stock~SP500) for each stock (using the Block1 example data):
library(data.table)
dt <- structure(list(date = c("2016-04-29", "2016-05-31", "2016-06-30",
"2016-07-29", "2016-08-31", "2016-09-30"), A = c(-0.139, 0.071,
-0.042, 0.09, 0.023, 0.065), AAPL = c(0.111, 0.095, -0.009, 0.06,
0.013, 0.088), BRKB = c(0.122, 0.33, 0.23, 0.2, 0.005, 0.002),
SP500 = c(0.15, 0.2, 0.15, 0.1, 0.05, 0.1)), .Names = c("date",
"A", "AAPL", "BRKB", "SP500"), row.names = c(NA, -6L), class = "data.frame")
setDT(dt)
# Convert to long format for easier lm
dt_melt <- melt(dt, id.vars = c("date", "SP500"))
# Extract coefficients by doing lm for each unique variable (i.e. stock)
dt_lm <- dt_melt[, as.list(coefficients(lm(value~SP500))), by = variable]
# Fix column names
setnames(dt_lm, c("stock", "intercept", "slope"))
> dt_lm
stock intercept slope
1: A 0.05496970 -0.3490909
2: AAPL 0.01421212 0.3636364
3: BRKB -0.10751515 2.0454545

Related

R: Select vector (numeric) from data frame, sample n=10 subsets of size i=5 and i= 10 within vector and calculate mean for each of these samples

I have the following problem:
Have a data frame, i.e. containing two vectors "Name" and "Values", one as text and one with numeric values, with 20 rows and 2 columns
I want to extract "Values" and sample randomly (with equal weight) 10x a subset of size 5 from the "Values" and calculate the mean. I want to capture those results (mean values) in another vector 10x1.
I want to do the same as step 2, however, instead of sampling a subset of size 5, I want to have more observations, i.e. 15 (from the 20 values). I take those 15 values, calculate the mean an re-iterate this step 10x, logging in the results in a new vector 10x1.
(4. Ultimately, I want to compare some descriptive statistics between these two vectors, i.e. expecting that the smaller subset size vector would have fatter tails, more negatively skewed etc).
Creating the data frame as a start
Name <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t")
Values <- c(0.1, 0.05, 0.03, 0.06, -0.1, -0.3, -0.05, 0.5, 0.12, 0.06, 0.04, 0.15, 0.13, 0.16, -0.12, -0.03, -0.5, 0.05, 0.07, 0.03)
data <- data.frame(Name, Values)
The relevant part:
# extract Values column
Values <- data$Values
# define sizes of subset and number of iterations
n_small <- 5
n_large <- 15
n_iterations <- 10
set.seed(123456)
# Initialize result vector
Averages_small <- NULL
Averages_large <- NULL
# Calculate average of the subset and allocate it to the result vector
for (i in n_iterations) {
Averages_small[i] <- mean(sample(Values, n_small, replace = FALSE))
Averages_large[i] <- mean(sample(Values, n_large, replace = FALSE))
}
Somehow this gives ma 9x NA and a number. What I am doing wrong? and is there a better way than for-loop this through, because above is an example and also no NA values, however, the original data set has 20k rows and it might "contain" missing values.
fyi, to give you a background: the Values are return figures of investments and the question is having a higher number of investments helps diversification.
Thank you very much for your help!
You can use replicate to get 10 draws of your sample. This returns a matrix with the samples in columns, so the colMeans of this matrix gives you the vector you are looking for:
set.seed(1) # For reproducibility
vec5 <- colMeans(replicate(10, sample(data$Values, 5)))
vec15 <- colMeans(replicate(10, sample(data$Values, 15)))
vec5
#> [1] -0.014 0.148 0.044 -0.026 0.062 0.020 -0.032 -0.130 0.166 0.040
vec15
#> [1] 0.058000000 0.024666667 0.051333333 0.045333333 0.024000000
#> [6] 0.010666667 0.022666667 -0.010000000 0.003333333 -0.001333333
You can see that the standard deviation of vec5 is indeed larger:
sd(vec5)
#> [1] 0.08711908
sd(vec15)
#> [1] 0.02297406
I know that this question has already been answered, but I have found the mistake in your original code that caused it to not work.
The code as you wrote it can actually work as you want it to, but the for loop only fired once; for (i in v) loops over a vector, repeating with each value listed. Remember that you set
n_iterations <- 10
So in your loop, you effectively had for (i in 10), such that the loop was only called once, meaning that the whole structure ended up being
Averages_small[10] <- mean(sample(Values, n_small, replace = FALSE))
Averages_large[10] <- mean(sample(Values, n_large, replace = FALSE))
What you want is for (i in 1:10), which creates a vector. This can be solved either be defining n_iterations <- 1:10, or (using your original setup)
set.seed(123456)
for (i in 1:n_iterations) {
Averages_small[i] <- mean(sample(Values, n_small, replace = FALSE))
Averages_large[i] <- mean(sample(Values, n_large, replace = FALSE))
}
Averages_small
#> [1] -0.066 0.042 0.036 0.018 0.080 0.016 -0.038 -0.180 0.132 0.042
Averages_large
#> [1] -0.02600000 -0.01266667 0.02000000 0.04666667 0.03533333 -0.02200000 -0.01533333 -0.00400000 0.03266667 0.07333333
I know that for loops are generally not optimal, and a solution that does not rely on one is probably superior, but I also thought that you would appreciate an explanation of why your code did not function correctly in the first place.

R: Solving for a variable (using the uniroot function)

I am rather new to R and really could need the help of the community with the following problem. I am trying to solve for the variable r in the following equation: (EPS2 + r*DPS1-EPS1)/r^2)-PRC. Here is my (unsuccessful) attempt on solving the problem (using the uniroot function):
EPS2 = df_final$EPS2
DPS1 = df_final$DPS1
EPS1 = df_final$EPS1
PRC = df_final$PRC
f1 = function(r) {
((df_final_test$EPS2 + r * df_final_test$DPS1-df_final_test$EPS1)/r^2)-df_final_test$PRC
}
uniroot(f1,interval = c(1e-8,100000),EPS2, DPS1, EPS1, PRC , extendInt="downX")$root
I then get the following error: Error in f(lower, ...) : unused
arguments (c(" 1.39", " 1.39", ...
I am grateful for any tips and hints you guys could give me in regard to this problem. Or whether a different function/package would be better in this case.
Added a reprex (?) in case that helps anybody in helping me with this issue:
df <- structure(list(EPS1 = c(6.53, 1.32, 1.39, 1.71, 2.13), DPS1 = c(2.53, 0.63,
0.81, 1.08, 1.33, 19.8), EPS2 = c(7.57,1.39,1.43,1.85,2.49), PRC = c(19.01,38.27,44.82,35.27,47.12)), .Names = c("EPS1", "DPS1", "EPS2", "PRC"), row.names = c(NA,
-5L), class = "data.frame")
I don't think you can use uniroot if all coefficients are vectors rather than scalars. In this case, a straightforward approach is solving them in a math way, i.e.,
r1 <- (DPS1 + sqrt(DPS1^2-4*PRC*(EPS1-EPS2)))/(2*PRC)
and
r2 <- (DPS1 - sqrt(DPS1^2-4*PRC*(EPS1-EPS2)))/(2*PRC)
where r1 and r2 are two roots.
Disclaimer: I have no experience with uniroot() and have not idea if the following makes sense, but it runs! The idea was to basically call uniroot for each row of the data frame.
Note that I modified the function f1 slightly so each of the additional parameters has are to be passed as arguments of the function and do not rely on finding the objects with the same name in the parent environment. I also use with to avoid calling df$... for every variable.
library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.1.0
library(furrr)
#> Loading required package: future
df <- structure(list(EPS1 = c(6.53, 1.32, 1.39, 1.71, 2.13),
DPS1 = c(2.53, 0.63, 0.81, 1.08, 1.33, 19.8),
EPS2 = c(7.57,1.39,1.43,1.85,2.49),
PRC = c(19.01,38.27,44.82,35.27,47.12)),
.Names = c("EPS1", "DPS1", "EPS2", "PRC"),
row.names = c(NA,-5L), class = "data.frame")
df
#> Warning in format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :
#> corrupt data frame: columns will be truncated or padded with NAs
#> EPS1 DPS1 EPS2 PRC
#> 1 6.53 2.53 7.57 19.01
#> 2 1.32 0.63 1.39 38.27
#> 3 1.39 0.81 1.43 44.82
#> 4 1.71 1.08 1.85 35.27
#> 5 2.13 1.33 2.49 47.12
f1 = function(r, EPS2, DPS1, EPS1, PRC) {
(( EPS2 + r * DPS1 - EPS1)/r^2) - PRC
}
# try for first row
with(df,
uniroot(f1,
EPS2=EPS2[1], DPS1=DPS1[1], EPS1=EPS1[1], PRC=PRC[1],
interval = c(1e-8,100000),
extendInt="downX")$root)
#> [1] 0.3097291
# it runs!
# loop over each row
vec_sols <- rep(NA, nrow(df))
for (i in seq_along(1:nrow(df))) {
sol <- with(df, uniroot(f1,
EPS2=EPS2[i], DPS1=DPS1[i], EPS1=EPS1[i], PRC=PRC[i],
interval = c(1e-8,100000),
extendInt="downX")$root)
vec_sols[i] <- sol
}
vec_sols
#> [1] 0.30972906 0.05177443 0.04022946 0.08015686 0.10265226
# Alternatively, you can use furrr's future_map_dbl to use multiple cores.
# the following will basically do the same as the above loop.
# here with 4 cores.
plan(multisession, workers = 4)
vec_sols <- 1:nrow(df) %>% furrr::future_map_dbl(
.f = ~with(df,
uniroot(f1,
EPS2=EPS2[.x], DPS1=DPS1[.x], EPS1=EPS1[.x], PRC=PRC[.x],
interval = c(1e-8,100000),
extendInt="downX")$root
))
vec_sols
#> [1] 0.30972906 0.05177443 0.04022946 0.08015686 0.10265226
# then apply the solutions back to the dataframe (each row to each solution)
df %>% mutate(
root = vec_sols
)
#> Warning in format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :
#> corrupt data frame: columns will be truncated or padded with NAs
#> EPS1 DPS1 EPS2 PRC root
#> 1 6.53 2.53 7.57 19.01 0.30972906
#> 2 1.32 0.63 1.39 38.27 0.05177443
#> 3 1.39 0.81 1.43 44.82 0.04022946
#> 4 1.71 1.08 1.85 35.27 0.08015686
#> 5 2.13 1.33 2.49 47.12 0.10265226
Created on 2021-06-20 by the reprex package (v2.0.0)

Two-step cluster

I have a dataset in this format:
structure(list(id = 1:4, var1_before = c(-0.16, -0.31, -0.26,
-0.77), var2_before = c(-0.7, -1.06, -0.51, -0.81), var3_before = c(2.47,
2.97, 2.91, 3.01), var4_before = c(-1.08, -1.22, -0.92, -1.16
), var5_before = c(0.54, 0.4, 0.46, 0.79), var1_after = c(-0.43,
-0.18, -0.59, 0.64), var2_after = c(-0.69, -0.38, -1.19, -0.77
), var3_after = c(2.97, 3.15, 3.35, 1.52), var4_after = c(-1.11,
-0.99, -1.26, -0.39), var5_after = c(1.22, 0.41, 1.01, 0.24)), class = "data.frame", row.names = c(NA,
-4L))
Every id is unique.
I would like to make two clusters:
First cluster for variables: var1_before, var2_before, var3_before, var4_before, var5_before
Second cluster for variables: var1_after, var2_after, var3_after, var4_after, var5_after
I used two-step cluster in spss for this.
How is it possible to make it in R?
This question is quite complex, this is how I'd approach the problem, hoping to help and maybe to start a discussion about it.
Note:
this is how I think I could approach the problem;
I do not know the two-step clustering, so I use a kmeans;
it's based on your data, but you can easily generalize it: I've made it dependent to your data because it's simpler to explain.
So, you create the first clustering with the before variables, then the value of the variable changes (after variables), and you want to see if the id are in the same cluster.
This leads me to think that you only need the first set of clusters (for the before variables), then see if the ids have changed: no need to do a second clustering, but only see if they've changed from the one cluster to another.
# first, you make your model of clustering, I'm using a simple kmeans
set.seed(1234)
model <- kmeans(df[,2:6],2)
# you put the clusters in the dataset
df$before_cluster <- model$cluster
Now the idea is to calculate the Euclidean distance from the ids with the new variables (after variables), to the centroids calculated on the before variabiles:
# for the first cluster
cl1 <- list()
for (i in 1:nrow(df)) {
cl1[[i]] <- dist(rbind(df[i,7:11], model$centers[1,] ))
}
cl1 <- do.call(rbind, cl1)
colnames(cl1) <- 'first'
# for the second cluster
cl2 <- list()
for (i in 1:nrow(df)) {
cl2[[i]] <- dist(rbind(df[i,7:11], model$centers[2,] ))
}
cl2 <- do.call(rbind, cl2)
colnames(cl2) <- 'second'
# put them together
df <- cbind(df, cl1, cl2)
Now the last part, you can define if one has changed the cluster, getting the smallest distance from the centroids (smallest --> it's the new cluster), fetching the "new" cluster.
df$new_cl <- ifelse(df$first < df$second, 1,2)
df
id var1_before var2_before var3_before var4_before var5_before var1_after var2_after var3_after var4_after var5_after first second before_cluster first second new_cl
1 1 -0.16 -0.70 2.47 -1.08 0.54 -0.43 -0.69 2.97 -1.11 1.22 0.6852372 0.8151840 2 0.6852372 0.8151840 1
2 2 -0.31 -1.06 2.97 -1.22 0.40 -0.18 -0.38 3.15 -0.99 0.41 0.7331098 0.5208887 1 0.7331098 0.5208887 2
3 3 -0.26 -0.51 2.91 -0.92 0.46 -0.59 -1.19 3.35 -1.26 1.01 0.6117598 1.1180004 2 0.6117598 1.1180004 1
4 4 -0.77 -0.81 3.01 -1.16 0.79 0.64 -0.77 1.52 -0.39 0.24 2.0848381 1.5994765 1 2.0848381 1.5994765 2
Seems they all have changed cluster.

How to pass list elements to model in R?

I am quite new to the use of lists so I apologize if this problem may sound very dumb.
From an original set of 459,046 customers, I have created a function that splits and stores the base in several elements of a list.
sampled_list <- baseSample(dataset = clv_df_cbs, sample.size = 10000, seed = 12345)
Executing this function (baseSample) you will get a new object list, containing mutually exclusive groups of customers (each group will be made of 10,000 customers - apart from the last one who may be smaller, depending on the initial volume)
> sampled_list <- baseSample(dataset = clv_df_cbs, sample.size = 10000, seed = 12345)
[1] "Seed: 12345"
[1] "Total groups created: 46"
[1] "Group size: 10000"
In this case, the output is a list of 46 elements stored in the object called sample_list.
Now, I want to pass each of these 46 elements to a BTYD model that will forecast the number of transactions in the next 90 days (given the learnings from the input).
The reason why I cannot pass the full dataset to the BTYD model is because this model heavily uses mcmc, therefore there is a long time of calculation that stops the model to provide any output. So I have decided to generate forecasts running the same model several times (on sample big enough) until I manage to pass all the base as model input.
The operations that need to be performed on each of the elements are the following
# Estimate parameters for element1 of the list
pggg.draws1 <- pggg.mcmc.DrawParameters(element1,
mcmc = 1000, # number of MCMC steps
burnin = 250, # number of initial MCMC steps which are discarded
thin = 10, # only every thin-th MCMC step will be returned
chains = 2, # number of MCMC chains to be run
trace = 50) # print logging step every trace iteration
# generate draws for holdout period
pggg.xstar.draws1 <- mcmc.DrawFutureTransactions(element1, pggg.draws1)
# conditional expectations
element1$xstar.pggg <- apply(pggg.xstar.draws1, 2, mean)
# P(active)
element1$pactive.pggg <- mcmc.PActive(pggg.xstar.draws1)
# P(alive)
element1$palive.pggg <- mcmc.PAlive(pggg.draws1)
# show estimates for first few customers
head(element1[, c("x", "t.x", "x.star",
"xstar.pggg", "pactive.pggg", "palive.pggg")],50)
# report median cohort-level parameter estimates
round(apply(as.matrix(pggg.draws1$level_2), 2, median), 3)
# report mean over median individual-level parameter estimates
median.est1 <- sapply(pggg.draws1$level_1, function(draw) {
apply(as.matrix(draw), 2, median)
})
round(apply(median.est1, 1, mean), 3)
Ideally, the output should be stored straight into a new data.frame - so I can retrieve the Id and the forecast (amongst other stuff originally included in the dataset).
Here below some mock data to play with from a publicly available dataset.
library(BTYDplus)
library(tidyverse)
data("groceryElog")
dataset<-elog2cbs(groceryElog, T.cal = "2006-12-01")
# FUNCTION baseSample ####
baseSample <- function(dataset, sample.size, seed=NULL) {
seed.value <- if(is.null(seed)) {
as.numeric(format(Sys.Date(),"%Y"))*10000+as.numeric(format(Sys.Date(),"%m"))*100+as.numeric(format(Sys.Date(),"%d"))
} else {
seed
}
set.seed(seed.value)
# RE-ORDER DATA FRAME (SAME LENGTH)
data <- with(dataset, dataset[order(sample(cust, nrow(dataset))),])
# BUILD A LIST OF DFs
set.sample.size <- sample.size
data$cycles_group <- paste0("sample_", ceiling(1:nrow(data)/set.sample.size))
df_list <- split(data, data$cycles_group)
print(paste0("Seed: ", seed.value))
print(paste0("Total groups created: ", length(unique(data$cycles_group))))
print(paste0("Group size: ", set.sample.size))
return(df_list)
#print(df_list)
}
# ** OUTPUT: Base split in lists ####
sampled_list <- baseSample(dataset = dataset, sample.size = 100, seed = 12345)
Thanks
In base R, you can use lapply to iterate a function over the elements of a list and return a new list with the results of those iterations. After using your example code to generate a list called sampled_list...
# turn the code for the operations you want to perform on each list element into a function,
# with a couple of minor tweaks
thingy <- function(i) {
# Estimate parameters for element1 of the list
pggg.draws1 <- pggg.mcmc.DrawParameters(i,
mcmc = 1000, # number of MCMC steps
burnin = 250, # number of initial MCMC steps which are discarded
thin = 10, # only every thin-th MCMC step will be returned
chains = 2, # number of MCMC chains to be run
trace = 50) # print logging step every trace iteration
# generate draws for holdout period
pggg.xstar.draws1 <- mcmc.DrawFutureTransactions(i, pggg.draws1)
# conditional expectations
i$xstar.pggg <- apply(pggg.xstar.draws1, 2, mean)
# P(active)
i$pactive.pggg <- mcmc.PActive(pggg.xstar.draws1)
# P(alive)
i$palive.pggg <- mcmc.PAlive(pggg.draws1)
# show estimates for first few customers [commenting out for this iterated version]
# head(element1[, c("x", "t.x", "x.star", "xstar.pggg", "pactive.pggg", "palive.pggg")],50)
# report median cohort-level parameter estimates
round(apply(as.matrix(pggg.draws1$level_2), 2, median), 3)
# report mean over median individual-level parameter estimates
median.est1 <- sapply(pggg.draws1$level_1, function(draw) {
apply(as.matrix(draw), 2, median)
})
# get the bits you want in a named vector
z <- round(apply(median.est1, 1, mean), 3)
# convert that named vector of results into a one-row data frame to make collapsing easier
data.frame(as.list(z))
}
# now use lapply to iterate that function over the elements of your list
results <- lapply(sampled_list, thingy)
# now bind the results into a data frame
boundresults <- do.call(rbind, results)
Results (which took a while to get):
k lambda mu tau z
sample_1 4.200 0.174 0.091 102.835 0.27
sample_10 3.117 0.149 0.214 128.143 0.29
sample_11 4.093 0.154 0.115 130.802 0.30
sample_12 4.191 0.142 0.053 114.108 0.33
sample_13 2.605 0.155 0.071 160.743 0.35
sample_14 9.196 0.210 0.084 111.747 0.36
sample_15 2.005 0.145 0.091 298.872 0.40
sample_16 2.454 0.111 0.019 78731750.121 0.70
sample_2 2.808 0.138 0.059 812.278 0.40
sample_3 4.327 0.166 0.116 559.318 0.42
sample_4 9.266 0.166 0.038 146.283 0.40
sample_5 3.277 0.157 0.073 105.915 0.33
sample_6 9.584 0.184 0.086 118.299 0.31
sample_7 4.244 0.189 0.118 54.945 0.23
sample_8 4.388 0.147 0.085 325.054 0.36
sample_9 7.898 0.181 0.052 83.892 0.33
You can also combine those last two steps into a single line of do.call(rbind, lapply(...)). If you want to make the row names in the results table into a column, you could do boundresults$sample <- row.names(boundresults) after making that table. And if you don't like creating new objects in your environment, you could put that function inside the call to lapply, i.e., lapply(sampled_list, function(i) { [your code] }).

Regression in R using vectorization and matrices

I have a vectorization Q in R using matrices. I have 2 Cols that need to be regressed against each using certain indices. Data is
matrix_senttoR = [ ...
0.11 0.95
0.23 0.34
0.67 0.54
0.65 0.95
0.12 0.54
0.45 0.43 ] ;
indices_forR = [ ...
1
1
1
2
2
2 ] ;
Col1 in matrix is data for say MSFT and GOOG (3 rows each) and Col2 is the return from benchmark StkIndex, on corresponding dates. The data is in matrix format as it is sent from Matlab.
I currently use
slope <- by( data.frame(matrix_senttoR), indices_forR, FUN=function(x)
{zyp.sen (X1~X2,data=x) $coeff[2] } )
betasFac <- sapply(slope , function(x) x+0)
I'm using data.frame above as I could not use cbind(). If I use cbind() then Matlab gives an error as it doesn't understand that format of data. I'm running these commands from inside Matlab (http://www.mathworks.com/matlabcentral/fileexchange/5051). You can replace zyp (zyp.sen) with lm.
BY is slow here (may be because of dataframes?). Is there a better way to do it? It takes 14secs+ for 150k rows of data. Can I instead use matrix-vectorization in R? Thanks.
This could easily be moved to a comment, but:
A few things to consider, I tend to avoid the by() function since its return value is a funky object. Instead, why not add your indices_forR vector to the data.frame?
df <- data.frame(matrix_senttoR)
df$indices_forR <- indices_forR
the plyr package does the work from here:
ddply(df,.(indices_forR),function(x) zyp.sen(X1~X2,data=x)$coeff[2])
you can easily multi-thread this operation using doMC or doSnow and the argument .parallel=TRUE to ddply.
if speed is the goal, I would also learn the data.table package (which wraps data.frame and is much faster). Also, I assume that the slow piece is the zyp.sen() call rather than the by() call. Executing on multiple cores will speed this along.
> dput(df)
structure(list(X1 = c(0.11, 0.23, 0.67, 0.65, 0.12, 0.45), X2 = c(0.95,
0.34, 0.54, 0.95, 0.54, 0.43), indices_forR = c(1, 1, 1, 2, 2,
2)), .Names = c("X1", "X2", "indices_forR"), row.names = c(NA,
-6L), class = "data.frame")
> ddply(df,.(indices),function(x) lm(X1~X2,data=x)$coeff[2])
indices X2
1 1 -0.3702172
2 2 0.6324900
I still think that you are overcomplicating things by moving from MATLAB to R and back. And passing 150k rows of data must be slowing things down considerably.
zyp.sen is actually pretty trivial to port to MATLAB. Here you go:
function [intercept, slope, intercepts, slopes, rank, residuals] = ZypSen(x, y)
% Computes a Thiel-Sen estimate of slope for a vector of data.
n = length(x);
slopes = arrayfun(#(i) ZypSlopediff(i, x, y, n), 1:(n - 1), ...
'UniformOutput', false);
slopes = [slopes{:}];
sni = isfinite(slopes);
slope = median(slopes(sni));
intercepts = y - slope * x;
intercept = median(intercepts);
rank = 2;
residuals = x - slope * y + intercept;
end
function z = ZypSlopediff(i, x, y, n)
z = (y(1:(n - i)) - y((i + 1):n)) ./ ...
(x(1:(n - i)) - x((i + 1):n));
end
I checked this using the R's example(zyp.sen), and it gives the same answer.
x = [0 1 2 4 5]
y = [6 4 1 8 7]
[int, sl, ints, sls, ra, res] = ZypSen(x, y)
You should really do some further checking though, just to be sure.

Resources