I am quite new to the use of lists so I apologize if this problem may sound very dumb.
From an original set of 459,046 customers, I have created a function that splits and stores the base in several elements of a list.
sampled_list <- baseSample(dataset = clv_df_cbs, sample.size = 10000, seed = 12345)
Executing this function (baseSample) you will get a new object list, containing mutually exclusive groups of customers (each group will be made of 10,000 customers - apart from the last one who may be smaller, depending on the initial volume)
> sampled_list <- baseSample(dataset = clv_df_cbs, sample.size = 10000, seed = 12345)
[1] "Seed: 12345"
[1] "Total groups created: 46"
[1] "Group size: 10000"
In this case, the output is a list of 46 elements stored in the object called sample_list.
Now, I want to pass each of these 46 elements to a BTYD model that will forecast the number of transactions in the next 90 days (given the learnings from the input).
The reason why I cannot pass the full dataset to the BTYD model is because this model heavily uses mcmc, therefore there is a long time of calculation that stops the model to provide any output. So I have decided to generate forecasts running the same model several times (on sample big enough) until I manage to pass all the base as model input.
The operations that need to be performed on each of the elements are the following
# Estimate parameters for element1 of the list
pggg.draws1 <- pggg.mcmc.DrawParameters(element1,
mcmc = 1000, # number of MCMC steps
burnin = 250, # number of initial MCMC steps which are discarded
thin = 10, # only every thin-th MCMC step will be returned
chains = 2, # number of MCMC chains to be run
trace = 50) # print logging step every trace iteration
# generate draws for holdout period
pggg.xstar.draws1 <- mcmc.DrawFutureTransactions(element1, pggg.draws1)
# conditional expectations
element1$xstar.pggg <- apply(pggg.xstar.draws1, 2, mean)
# P(active)
element1$pactive.pggg <- mcmc.PActive(pggg.xstar.draws1)
# P(alive)
element1$palive.pggg <- mcmc.PAlive(pggg.draws1)
# show estimates for first few customers
head(element1[, c("x", "t.x", "x.star",
"xstar.pggg", "pactive.pggg", "palive.pggg")],50)
# report median cohort-level parameter estimates
round(apply(as.matrix(pggg.draws1$level_2), 2, median), 3)
# report mean over median individual-level parameter estimates
median.est1 <- sapply(pggg.draws1$level_1, function(draw) {
apply(as.matrix(draw), 2, median)
})
round(apply(median.est1, 1, mean), 3)
Ideally, the output should be stored straight into a new data.frame - so I can retrieve the Id and the forecast (amongst other stuff originally included in the dataset).
Here below some mock data to play with from a publicly available dataset.
library(BTYDplus)
library(tidyverse)
data("groceryElog")
dataset<-elog2cbs(groceryElog, T.cal = "2006-12-01")
# FUNCTION baseSample ####
baseSample <- function(dataset, sample.size, seed=NULL) {
seed.value <- if(is.null(seed)) {
as.numeric(format(Sys.Date(),"%Y"))*10000+as.numeric(format(Sys.Date(),"%m"))*100+as.numeric(format(Sys.Date(),"%d"))
} else {
seed
}
set.seed(seed.value)
# RE-ORDER DATA FRAME (SAME LENGTH)
data <- with(dataset, dataset[order(sample(cust, nrow(dataset))),])
# BUILD A LIST OF DFs
set.sample.size <- sample.size
data$cycles_group <- paste0("sample_", ceiling(1:nrow(data)/set.sample.size))
df_list <- split(data, data$cycles_group)
print(paste0("Seed: ", seed.value))
print(paste0("Total groups created: ", length(unique(data$cycles_group))))
print(paste0("Group size: ", set.sample.size))
return(df_list)
#print(df_list)
}
# ** OUTPUT: Base split in lists ####
sampled_list <- baseSample(dataset = dataset, sample.size = 100, seed = 12345)
Thanks
In base R, you can use lapply to iterate a function over the elements of a list and return a new list with the results of those iterations. After using your example code to generate a list called sampled_list...
# turn the code for the operations you want to perform on each list element into a function,
# with a couple of minor tweaks
thingy <- function(i) {
# Estimate parameters for element1 of the list
pggg.draws1 <- pggg.mcmc.DrawParameters(i,
mcmc = 1000, # number of MCMC steps
burnin = 250, # number of initial MCMC steps which are discarded
thin = 10, # only every thin-th MCMC step will be returned
chains = 2, # number of MCMC chains to be run
trace = 50) # print logging step every trace iteration
# generate draws for holdout period
pggg.xstar.draws1 <- mcmc.DrawFutureTransactions(i, pggg.draws1)
# conditional expectations
i$xstar.pggg <- apply(pggg.xstar.draws1, 2, mean)
# P(active)
i$pactive.pggg <- mcmc.PActive(pggg.xstar.draws1)
# P(alive)
i$palive.pggg <- mcmc.PAlive(pggg.draws1)
# show estimates for first few customers [commenting out for this iterated version]
# head(element1[, c("x", "t.x", "x.star", "xstar.pggg", "pactive.pggg", "palive.pggg")],50)
# report median cohort-level parameter estimates
round(apply(as.matrix(pggg.draws1$level_2), 2, median), 3)
# report mean over median individual-level parameter estimates
median.est1 <- sapply(pggg.draws1$level_1, function(draw) {
apply(as.matrix(draw), 2, median)
})
# get the bits you want in a named vector
z <- round(apply(median.est1, 1, mean), 3)
# convert that named vector of results into a one-row data frame to make collapsing easier
data.frame(as.list(z))
}
# now use lapply to iterate that function over the elements of your list
results <- lapply(sampled_list, thingy)
# now bind the results into a data frame
boundresults <- do.call(rbind, results)
Results (which took a while to get):
k lambda mu tau z
sample_1 4.200 0.174 0.091 102.835 0.27
sample_10 3.117 0.149 0.214 128.143 0.29
sample_11 4.093 0.154 0.115 130.802 0.30
sample_12 4.191 0.142 0.053 114.108 0.33
sample_13 2.605 0.155 0.071 160.743 0.35
sample_14 9.196 0.210 0.084 111.747 0.36
sample_15 2.005 0.145 0.091 298.872 0.40
sample_16 2.454 0.111 0.019 78731750.121 0.70
sample_2 2.808 0.138 0.059 812.278 0.40
sample_3 4.327 0.166 0.116 559.318 0.42
sample_4 9.266 0.166 0.038 146.283 0.40
sample_5 3.277 0.157 0.073 105.915 0.33
sample_6 9.584 0.184 0.086 118.299 0.31
sample_7 4.244 0.189 0.118 54.945 0.23
sample_8 4.388 0.147 0.085 325.054 0.36
sample_9 7.898 0.181 0.052 83.892 0.33
You can also combine those last two steps into a single line of do.call(rbind, lapply(...)). If you want to make the row names in the results table into a column, you could do boundresults$sample <- row.names(boundresults) after making that table. And if you don't like creating new objects in your environment, you could put that function inside the call to lapply, i.e., lapply(sampled_list, function(i) { [your code] }).
Related
In my working project, I use rfe function from caret package to do recursive feature elimination. I use a toy example to illustrate my point.
library(mlbench)
library(caret)
data(PimaIndiansDiabetes)
rfFuncs$summary <- twoClassSummary
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl=control, metric="ROC")
The optimal variable selected is based on those variables that give highest auroc in the process and can be retrieved by results$optVariables.
However, what I want to do is use '1 standard error rule' to select less features (code below). The number of variables identified is 4.
# auc that is 1-se from the highest auc
df.results = results$results %>% dplyr::mutate(ROCSE = ROCSD/sqrt(10-1))
idx = which.max(df.results$ROC)
ROC.1se = df.results$ROC[idx] - df.results$ROCSE[idx]
# plot ROC vs feature size
g = ggplot(df.results, aes(x=Variables, y=ROC)) +
geom_errorbar(aes(ymin=ROC-ROCSE, ymax=ROC+ROCSE),
width=.2, alpha=0.4, linetype=1) +
geom_line() +
geom_point()+
scale_color_brewer(palette="Paired")+
geom_hline(yintercept = ROC.1se)+
labs(x ="Number of Variables", y = "AUROC")
print(g)
The number of variables I identified is 4. Now I need to know which four variables. I did below:
results$variables %>% filter(Variables==4) %>% distinct(var)
It shows me 5 variables!
Does anyone know how I can retrieve those variables? Basically it applies to get those variables for any number of variables selected.
Thanks a lot in advance!
One-line Answer
If you know you want only the best 4 variables from the rfe resampling, this will give you what you are looking for.
results$optVariables[1:4]
# [1] "glucose" "mass" "age" "pregnant"
dplyr Answer
# results$variables %>%
# group_by(var) %>%
# summarize(Overall = mean(Overall)) %>%
# arrange(-Overall)
#
# A tibble: 8 x 2
# var Overall
# <chr> <dbl>
# 1 glucose 34.2
# 2 mass 15.8
# 3 age 12.7
# 4 pregnant 7.92
# 5 pedigree 5.09
# 6 insulin 4.87
# 7 triceps 3.25
# 8 pressure 1.95
Why your attempt gives more than 4 variables
You are filtering 40 observations. 10 folds of the best 4 variables. The best 4 variables is not always the same within each fold. Hence, to get the best top 4 variables across the resamples you need to average their performance across the folds as the code above does. Even simpler, the variables within optVariables are sorted in this order, so you can just grab the first 4 (as in my one-line answer). The proof that this is the case takes a bit of digging into the source code (shown below).
Details: Digging into the source code
A good first thing to do with objects returned from functions like rfe is to try functions like print, summary, or plot. Often custom methods will exist that will give you very helpful information. For example...
# Run rfe with a random seed
# library(dplyr)
# library(mlbench)
# library(caret)
# data(PimaIndiansDiabetes)
# rfFuncs$summary <- twoClassSummary
# control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# set.seed(1)
# results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8),
# rfeControl=control, metric="ROC")
#
# The next two lines identical...
results
print(results)
# Recursive feature selection
#
# Outer resampling method: Cross-Validated (10 fold)
#
# Resampling performance over subset size:
#
# Variables ROC Sens Spec ROCSD SensSD SpecSD Selected
# 1 0.7250 0.870 0.4071 0.07300 0.07134 0.10322
# 2 0.7842 0.840 0.5677 0.04690 0.04989 0.05177
# 3 0.8004 0.824 0.5789 0.02823 0.04695 0.10456
# 4 0.8139 0.842 0.6269 0.03210 0.03458 0.05727
# 5 0.8164 0.844 0.5969 0.02850 0.02951 0.07288
# 6 0.8263 0.836 0.6078 0.03310 0.03978 0.07959
# 7 0.8314 0.844 0.5966 0.03075 0.04502 0.07232
# 8 0.8316 0.860 0.6081 0.02359 0.04522 0.07316 *
#
# The top 5 variables (out of 8):
# glucose, mass, age, pregnant, pedigree
Hmm, that gives 5 variables, but you said you wanted 4. We can pretty quickly dig into the source code to explore how it is calculating and returning those 5 variables as the top 5 variables.
print(caret:::print.rfe)
#
# Only a snippet code shown below...
# cat("The top ", min(top, x$bestSubset), " variables (out of ",
# x$bestSubset, "):\n ", paste(x$optVariables[1:min(top,
# x$bestSubset)], collapse = ", "), "\n\n", sep = "")
So, basically it is pulling the top 5 variables directly from results$optVariables. How is that getting populated?
# print(caret:::rfe.default)
#
# Snippet 1 of code...
# bestVar <- rfeControl$functions$selectVar(selectedVars,
bestSubset)
#
# Snippet 2 of code...
# bestSubset = bestSubset, fit = fit, optVariables = bestVar,
Ok, optVariables is getting populated by rfeControl$functions$selectVar.
print(rfeControl)
#
# Snippet of code...
# list(functions = if (is.null(functions)) caretFuncs else functions,
From above, we see that caretFuncs$selectVar is being used...
Details: Source code that is populating optVariables
print(caretFuncs$selectVar)
# function (y, size)
# {
# finalImp <- ddply(y[, c("Overall", "var")], .(var), function(x) mean(x$Overall,
# na.rm = TRUE))
# names(finalImp)[2] <- "Overall"
# finalImp <- finalImp[order(finalImp$Overall, decreasing = TRUE),
# ]
# as.character(finalImp$var[1:size])
# }
I am looking to identify the simulation package in R to identify the perfect weights, that enables me allocate my datapoints into the maximum bucket.
Basically, i want to tune my weights in a such a way the achieve my goal.
Below is the example.
Score1,Score2,Score3,Final,Group
0.87,0.73,0.41,0.63,"60-100"
0.82,0.73,0.85,0.796,"70-80"
0.82,0.37,0.85,0.652,"60-65"
0.58,0.95,0.42,0.664,"60-65"
1,1,0.9,0.96,"90-100"
Weight1,Weight2,Weight3
0.2,0.4,0.4
Final Score= Score1*Weight1+ Score2*Weight2+Score3*Weight3
The sum of my weights is 1. W1+W2+W3=1
i want to tune my weights in such a way that most of my cases lie into the "90-100" bucket. I know there won't be a perfect combination, but want to capture the maximum cases. I am currently trying to do the same in excel manually, using Pivot, but want to know if there is any package in R, that helps me to achieve my goal.
THe group allocation "70-80" "80-90" is something i have made in excel, using if else condition.
R Pivot Result:
"60-100",1
"60-65",2
"70-80",1
"90-100",1
Would appreciate if someone can help me to for the same.
Thanks,
Here's an approach that tries to get all the final scores as close as possible to 0.9 using a nested optimisation approach.
Here's your original data:
# Original data
df <- read.table(text = "Score1, Score2, Score3
0.87,0.73,0.41
0.82,0.73,0.85
0.82,0.37,0.85
0.58,0.95,0.42
1,1,0.9", header = TRUE, sep = ",")
This is the cost function for the first weight.
# Outer cost function
cost_outer <- function(w1){
# Run nested optimisation
res <- optimise(cost_nested, lower = 0, upper = 1 - w1, w1 = w1)
# Spit second weight into a global variable
res_outer <<- res$minimum
# Return the cost function value
res$objective
}
This is the cost function for the second weight.
# Nested cost function
cost_nested <- function(w2, w1){
# Calculate final weight
w <- c(w1, w2, 1 - w2 -w1)
# Distance from desired interval
res <- 0.9 - rowSums(w*df)
# Zero if negative distance, square distance otherwise
res <- sum(ifelse(res < 0, 0, res^2))
}
Next, I run the optimisation.
# Repackage weights
weight <- c(optimise(cost_outer, lower = 0, upper = 1)$minimum, res_outer)
weight <- c(weight, 1 - sum(weight))
Finally, I show the results.
# Final scores
cbind(df, Final = rowSums(weight * df))
# Score1 Score2 Score3 Final
# 1 0.87 0.73 0.41 0.7615286
# 2 0.82 0.73 0.85 0.8229626
# 3 0.82 0.37 0.85 0.8267400
# 4 0.58 0.95 0.42 0.8666164
# 5 1.00 1.00 0.90 0.9225343
Notice, however, that this code gets the final scores as close as possible to the interval, which is different from getting the most scores in that interval. That can be achieved by switching out the nested cost function with something like:
# Nested cost function
cost_nested <- function(w2, w1){
# Calculate final weight
w <- c(w1, w2, 1 - w2 -w1)
# Number of instances in desired interval
res <- sum(rowSums(w*df) < 0.9)
}
This can be formulated as a Mixed Integer Programming (MIP) problem. The mathematical model can look like:
The binary variable δi indicates if final weight Fi is inside the interval [0.9,1]. M is "large" value (if all your data is between 0 and 1 we can choose M=1). ai,j is your data.
The objective function and all constraints are linear, so we can use standard MIP solvers to solve this problem. MIP solvers for R are readily available.
PS in the example groups overlap. That does not make much sense to me. I think if we have "90-100" we should not also have "60-100".
PS2. If all data is between 0 and 1, we can simplify the sandwich equation a bit: we can drop the right part.
For the small example data set I get:
---- 56 PARAMETER a
j1 j2 j3
i1 0.870 0.730 0.410
i2 0.820 0.730 0.850
i3 0.820 0.370 0.850
i4 0.580 0.950 0.420
i5 1.000 1.000 0.900
---- 56 VARIABLE w.L weights
j1 0.135, j2 0.865
---- 56 VARIABLE f.L final scores
i1 0.749, i2 0.742, i3 0.431, i4 0.900, i5 1.000
---- 56 VARIABLE delta.L selected
i4 1.000, i5 1.000
---- 56 VARIABLE z.L = 2.000 objective
(zeros are not printed)
If I think I understand something I like to verify, so in this case I was trying to verify the calculation of the Partial Autocorrelation. pacf().
what I end up with is something a little different. My understanding is that the pacf would be the coefficient of the regression of the last/furthest lag given all of the previous lags. So to set up some code, I'm using Canadian employment data and the book Elements of Forecasting by F. Diebold (1998) Chapter6
#Obtain Canadian Employment dataset
caemp <- c(83.090255, 82.7996338824, 84.6344380294, 85.3774583529, 86.197605, 86.5788438824, 88.0497240294, 87.9249263529, 88.465131, 88.3984638824, 89.4494320294, 90.5563753529, 92.272335, 92.1496788824, 93.9564890294, 94.8114863529, 96.583434, 96.9646728824, 98.9954360294, 101.138164353, 102.882122, 103.095394882, 104.006386029, 104.777404353, 104.701732, 102.563504882, 103.558486029, 102.985774353, 102.098281, 101.471734882, 102.550696029, 104.021564353, 105.093652, 105.194954882, 104.594266029, 105.813184353, 105.149642, 102.899434882, 102.354736029, 102.033974353, 102.014299, 101.835654882, 102.018806029, 102.733834353, 103.134062, 103.263354882, 103.866416029, 105.393274353, 107.081242, 108.414274882, 109.297286029, 111.495994353, 112.680072, 113.061304882, 112.376636029, 111.244054353, 107.305192, 106.678644882, 104.678246029, 105.729204353, 107.837082, 108.022364882, 107.281706029, 107.016934353, 106.045452, 106.370704882, 106.049966029, 105.841184353, 106.045452, 106.650644882, 107.393676029, 108.668584353, 109.628702, 110.261894882, 110.920946029, 110.740154353, 110.048622, 108.190324882, 107.057746029, 108.024724353, 109.712692, 111.409654882, 108.765396029, 106.289084353, 103.917902, 100.799874882, 97.3997700294, 93.2438143529, 94.123068, 96.1970798824, 97.2754290294, 96.4561423529, 92.674237, 92.8536228824, 93.4304540294, 93.2055593529, 93.955896, 94.7296738824, 95.5665510294, 95.5459793529, 97.09503, 97.7573598824, 96.1609430294, 96.5861653529, 103.874812, 105.094384882, 106.804276029, 107.786744353, 106.596022, 107.310354882, 106.897156029, 107.210924353, 107.134682, 108.829774882, 107.926196029, 106.298904353, 103.365872, 102.029554882, 99.3000760294, 95.3045073529, 90.50099, 88.0984848824, 86.5150710294, 85.1143943529, 89.033584, 88.8229008824, 88.2666710294, 87.7260053529, 88.102896, 87.6546968824, 88.4004090294, 88.3618013529, 89.031151, 91.0202948824, 91.6732820294, 92.0149173529)
# create time series with the canadian employment dataset
caemp.ts<-ts(caemp, start=c(1961, 1), end=c(1994, 4), frequency=4)
caemp.ts2<-window(caemp.ts,start=c(1961,5), end=c(1993,4))
# set up max lag the book says use sqrt(T) but in this case i'm using 3 for the example
lag.max <- 3
# R Code using pacf()
pacf(caemp.ts2, lag.max=3, plot=F)
# initialize vector to capture the partial autocorrelations
pauto.corr <- rep(0, lag.max)
# Set up lagged data frame
pa.mat <- as.data.frame(caemp.ts2)
for(i in 1:lag.max){
a <- c(rep(NA, i), pa.mat[1:(length(caemp.ts2) - i),1])
pa.mat <- cbind(pa.mat, a)
}
names(pa.mat) <- c("0":lag.max)
# Set up my base linear model
base.lm <- lm(pa.mat[, 1] ~ 1)
### I could not get the for loop to work successfully here
i <- 1
base.lm <- update(base.lm, .~. + pa.mat[,2])
pauto.corr[i]<-base.lm$coefficients[length(base.lm$coefficients)]
i<-2
base.lm <-update(base.lm, .~. + pa.mat[,3])
pauto.corr[i]<-base.lm$coefficients[length(base.lm$coefficients)]
i<-3
base.lm <-update(base.lm, .~. + pa.mat[,4])
pauto.corr[i]<-base.lm$coefficients[length(base.lm$coefficients)]
# Compare results...
round(pauto.corr,3)
pacf(caemp.ts2, lag.max=3, plot=F)
For the output
> round(pauto.corr,3)
[1] 0.971 -0.479 -0.072
> pacf(caemp.ts2, lag.max=3, plot=F)
Partial autocorrelations of series ‘caemp.ts2’, by lag
0.25 0.50 0.75
0.949 -0.244 -0.100
Maybe it is because my example is quarterly and not monthly data, or I could just be wrong?
I'm running k-means clustering on a data frame df1, and I'm looking for a simple approach to computing the closest cluster center for each observation in a new data frame df2 (with the same variable names). Think of df1 as the training set and df2 on the testing set; I want to cluster on the training set and assign each test point to the correct cluster.
I know how to do this with the apply function and a few simple user-defined functions (previous posts on the topic have usually proposed something similar):
df1 <- data.frame(x=runif(100), y=runif(100))
df2 <- data.frame(x=runif(100), y=runif(100))
km <- kmeans(df1, centers=3)
closest.cluster <- function(x) {
cluster.dist <- apply(km$centers, 1, function(y) sqrt(sum((x-y)^2)))
return(which.min(cluster.dist)[1])
}
clusters2 <- apply(df2, 1, closest.cluster)
However, I'm preparing this clustering example for a course in which students will be unfamiliar with the apply function, so I would much prefer if I could assign the clusters to df2 with a built-in function. Are there any convenient built-in functions to find the closest cluster?
You could use the flexclust package, which has an implemented predict method for k-means:
library("flexclust")
data("Nclus")
set.seed(1)
dat <- as.data.frame(Nclus)
ind <- sample(nrow(dat), 50)
dat[["train"]] <- TRUE
dat[["train"]][ind] <- FALSE
cl1 = kcca(dat[dat[["train"]]==TRUE, 1:2], k=4, kccaFamily("kmeans"))
cl1
#
# call:
# kcca(x = dat[dat[["train"]] == TRUE, 1:2], k = 4)
#
# cluster sizes:
#
# 1 2 3 4
#130 181 98 91
pred_train <- predict(cl1)
pred_test <- predict(cl1, newdata=dat[dat[["train"]]==FALSE, 1:2])
image(cl1)
points(dat[dat[["train"]]==TRUE, 1:2], col=pred_train, pch=19, cex=0.3)
points(dat[dat[["train"]]==FALSE, 1:2], col=pred_test, pch=22, bg="orange")
There are also conversion methods to convert the results from cluster functions like stats::kmeans or cluster::pam to objects of class kcca and vice versa:
as.kcca(cl, data=x)
# kcca object of family ‘kmeans’
#
# call:
# as.kcca(object = cl, data = x)
#
# cluster sizes:
#
# 1 2
# 50 50
Something I noticed about both the approach in the question and the flexclust approaches are that they are rather slow (benchmarked here for a training and testing set with 1 million observations with 2 features each).
Fitting the original model is reasonably fast:
set.seed(144)
df1 <- data.frame(x=runif(1e6), y=runif(1e6))
df2 <- data.frame(x=runif(1e6), y=runif(1e6))
system.time(km <- kmeans(df1, centers=3))
# user system elapsed
# 1.204 0.077 1.295
The solution I posted in the question is slow at calculating the test-set cluster assignments, since it separately calls closest.cluster for each test-set point:
system.time(pred.test <- apply(df2, 1, closest.cluster))
# user system elapsed
# 42.064 0.251 42.586
Meanwhile, the flexclust package seems to add a lot of overhead regardless of whether we convert the fitted model with as.kcca or fit a new one ourselves with kcca (though the prediction at the end is much faster)
# APPROACH #1: Convert from the kmeans() output
system.time(km.flexclust <- as.kcca(km, data=df1))
# user system elapsed
# 87.562 1.216 89.495
system.time(pred.flexclust <- predict(km.flexclust, newdata=df2))
# user system elapsed
# 0.182 0.065 0.250
# Approach #2: Fit the k-means clustering model in the flexclust package
system.time(km.flexclust2 <- kcca(df1, k=3, kccaFamily("kmeans")))
# user system elapsed
# 125.193 7.182 133.519
system.time(pred.flexclust2 <- predict(km.flexclust2, newdata=df2))
# user system elapsed
# 0.198 0.084 0.302
It seems that there is another sensible approach here: using a fast k-nearest neighbors solution like a k-d tree to find the nearest neighbor of each test-set observation within the set of cluster centroids. This can be written compactly and is relatively speedy:
library(FNN)
system.time(pred.knn <- get.knnx(km$center, df2, 1)$nn.index[,1])
# user system elapsed
# 0.315 0.013 0.345
all(pred.test == pred.knn)
# [1] TRUE
You can use the ClusterR::KMeans_rcpp() function, use RcppArmadillo. It allows for multiple initializations (which can be parallelized if Openmp is available). Besides optimal_init, quantile_init, random and kmeans ++ initilizations one can specify the centroids using the CENTROIDS parameter. The running time and convergence of the algorithm can be adjusted using the num_init, max_iters and tol parameters.
library(scorecard)
library(ClusterR)
library(dplyr)
library(ggplot2)
## Generate data
set.seed(2019)
x = c(rnorm(200000, 0,1), rnorm(150000, 5,1), rnorm(150000,-5,1))
y = c(rnorm(200000,-1,1), rnorm(150000, 6,1), rnorm(150000, 6,1))
df <- split_df(data.frame(x,y), ratio = 0.5, seed = 123)
system.time(
kmrcpp <- KMeans_rcpp(df$train, clusters = 3, num_init = 4, max_iters = 100, initializer = 'kmeans++'))
# user system elapsed
# 0.64 0.05 0.82
system.time(pr <- predict_KMeans(df$test, kmrcpp$centroids))
# user system elapsed
# 0.01 0.00 0.02
p1 <- df$train %>% mutate(cluster = as.factor(kmrcpp$clusters)) %>%
ggplot(., aes(x,y,color = cluster)) + geom_point() +
ggtitle("train data")
p2 <- df$test %>% mutate(cluster = as.factor(pr)) %>%
ggplot(., aes(x,y,color = cluster)) + geom_point() +
ggtitle("test data")
gridExtra::grid.arrange(p1,p2,ncol = 2)
I am currently learning R. I have no previous knowledge of STATA.
I want to reanalyze a study which was done in Stata (xtpcse linear regression with panel-corrected standard errors). I could not find the model or more detailed code in Stata or any other hint how to rewrite this in R. I have the plm package for econometrics installed for R. That's as far as I got.
The first lines of the .do file from STATA are copied below (I just saw that it's pretty unreadable. Here is a link to the txt file in which I copied the .do content: http://dl.dropbox.com/u/4004629/This%20was%20in%20the%20.do%20file.txt).
I have no idea of how to go about this in a better way. I tried google-ing STATA and R comparison and the like but it did not work.
All data for the study I want to replicate are here:
https://umdrive.memphis.edu/rblanton/public/ISQ_data
---STATA---
Group variable: c_code Number of obs = 265
Time variable: year Number of groups = 27
Panels: correlated (unbalanced) Obs per group: min = 3
Autocorrelation: common AR(1) avg = 9.814815
Sigma computed by pairwise selection max = 14
Estimated covariances = 378 R-squared = 0.8604
Estimated autocorrelations = 1 Wald chi2(11) = 8321.15
Estimated coefficients = 15 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Panel-corrected
food | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lag_food | .8449038 .062589 13.50 0.000 .7222316 .967576
ciri | -.010843 .0222419 -0.49 0.626 -.0544364 .0327504
human_cap | .0398406 .0142954 2.79 0.005 .0118222 .0678591
worker_rts | -.1132705 .0917999 -1.23 0.217 -.2931951 .066654
polity_4 | .0113995 .014002 0.81 0.416 -.0160439 .0388429
market_size | .0322474 .0696538 0.46 0.643 -.1042716 .1687665
income | .0382918 .0979499 0.39 0.696 -.1536865 .2302701
econ_growth | .0145589 .0105009 1.39 0.166 -.0060224 .0351402
log_trade | -.3062828 .1039597 -2.95 0.003 -.5100401 -.1025256
fix_dollar | -.0351874 .1129316 -0.31 0.755 -.2565293 .1861545
fixed_xr | -.4941214 .2059608 -2.40 0.016 -.897797 -.0904457
xr_fluct | .0019044 .0106668 0.18 0.858 -.0190021 .0228109
lab_growth | .0396278 .0277936 1.43 0.154 -.0148466 .0941022
english | -.1594438 .1963916 -0.81 0.417 -.5443641 .2254766
_cons | .4179213 1.656229 0.25 0.801 -2.828227 3.66407
-------------+----------------------------------------------------------------
rho | .0819359
------------------------------------------------------------------------------
. xtpcse fab_metal lag_fab_metal ciri human_cap worker_rts polity_4 market
> income econ_growth log_trade fix_dollar fixed_xr xr_fluct lab_growth
> english, pairwise corr(ar1)
Update:
I just tried Vincent's code. I tried the pcse2 and vcovBK code, and they both worked (even though I'm not sure what to do with the correlation matrix that comes out of vcocBK).
However, I still have troubles reproducing the estimates of the regression coefficients in the paper I'm reanalyzing. I'm following their recipe as good as I can, the only step I'm missing is, I think, the part where in Stata "Autocorrelation: common AR(1)" is done. The paper I'm analyzing says: "OLS regression using panel corrected standard errors (Beck/Katz '95), control for first order correlation within each panel (corr AR1 option in Stata)."
How do I control for first order correlation within each panel in R?
Here is what I did so far on my data:
## run lm
res.lm <- lm(total_FDI ~ ciri + human_cap + worker_rts + polity_4 + lag_total + market_size + income + econ_growth + log_trade + fixed_xr + fix_dollar + xr_fluct + english + lab_growth, data=D)
## run pcse
res.pcse <- pcse2(res.lm,groupN="c_code",groupT="year",pairwise=TRUE)
As Ramnath mentioned, the pcse package will do what Stata's xtpcse does. Alternatively, you could use the vcovBK() function from the plm package. If you opt for the latter option, make sure you use the cluster='time' option, which is what the Beck & Katz (1995) article suggests and what the Stata command implements.
The pcse package works well, but there are some issues that makes a lot of intuitive user inputs unacceptable, especially if your dataset is unbalanced. You might want to try this re-write of the function that I coded a while ago. Just load the pcse package, load the pcse2 function, and use it by following the instructions in the pcse documentation. IMHO, the function pasted below is cleaner, more flexible and more robust than the one provided by the pcse folks. Simple benchmarks also suggest that my version may be 5 to 10 times faster than theirs, which may matter for big datasets.
Good luck!
library(Matrix)
pcse2 <- function(object, groupN, groupT, pairwise=TRUE){
## Extract basic model info
groupT <- tail(as.character((match.call()$groupT)), 1)
groupN <- tail(as.character((match.call()$groupN)), 1)
dat <- eval(parse(text=object$call$data))
## Sanity checks
if(!"lm" %in% class(object)){stop("Formula object must be of class 'lm'.")}
if(!groupT %in% colnames(dat)){stop(paste(groupT, 'was not found in data', object$call$data))}
if(!groupN %in% colnames(dat)){stop(paste(groupN, 'was not found in data', object$call$data))}
if(anyDuplicated(paste(dat[,groupN], dat[,groupT]))>0){stop(paste('There are duplicate groupN-groupT observations in', object$call$data))}
if(length(dat[is.na(dat[,groupT]),groupT])>0){stop('There are missing unit indices in the data.')}
if(length(dat[is.na(dat[,groupN]),groupN])>0){stop('There are missing time indices in the data.')}
## Expand model frame to include groupT, groupN, resid columns.
f <- as.formula(object$call$formula)
f.expanded <- update.formula(f, paste(". ~ .", groupN, groupT, sep=" + "))
dat.pcse <- model.frame(f.expanded, dat)
dat.pcse$e <- resid(object)
## Extract basic model info (part II)
N <- length(unique(dat.pcse[,groupN]))
T <- length(unique(dat.pcse[,groupT]))
nobs <- nrow(dat.pcse)
is.balanced <- length(resid(object)) == N * T
## If balanced dataset, calculate as in Beck & Katz (1995)
if(is.balanced){
dat.pcse <- dat.pcse[order(dat.pcse[,groupN], dat.pcse[,groupT]),]
X <- model.matrix(f, dat.pcse)
E <- t(matrix(dat.pcse$e, N, T, byrow=TRUE))
Omega <- kronecker((crossprod(E) / T), Matrix(diag(1, T)) )
## If unbalanced and pairwise, calculate as in Franzese (1996)
}else if(pairwise==TRUE){
## Rectangularize
rectangle <- expand.grid(unique(dat.pcse[,groupN]), unique(dat.pcse[,groupT]))
names(rectangle) <- c(groupN, groupT)
rectangle <- merge(rectangle, dat.pcse, all.x=TRUE)
rectangle <- rectangle[order(rectangle[,groupN], rectangle[,groupT]),]
valid <- ifelse(is.na(rectangle$e),0,1)
rectangle[is.na(rectangle)] <- 0
X <- model.matrix(f, rectangle)
X[valid==0,1] <- 0
## Calculate pcse
E <- crossprod(t(matrix(rectangle$e, N, T, byrow=TRUE)))
V <- crossprod(t(matrix(valid, N, T, byrow=TRUE)))
if (length(V[V==0]) > 0){stop("Error! A CS-unit exists without any obs or without any obs in a common period with another CS-unit. You must remove that unit from the data passed to pcse().")}
Omega <- kronecker(E/V, Matrix(diag(1, T)))
## If unbalanced and casewise, caluate based on largest rectangular subset of data
}else{
## Rectangularize
rectangle <- expand.grid(unique(dat.pcse[,groupN]), unique(dat.pcse[,groupT]))
names(rectangle) <- c(groupN, groupT)
rectangle <- merge(rectangle, dat.pcse, all.x=TRUE)
rectangle <- rectangle[order(rectangle[,groupN], rectangle[,groupT]),]
valid <- ifelse(is.na(rectangle$e),0,1)
rectangle[is.na(rectangle)] <- 0
X <- model.matrix(f, rectangle)
X[valid==0,1] <- 0
## Keep only years for which we have the max number of observations
large.panels <- by(dat.pcse, dat.pcse[,groupT], nrow) # How many valid observations per year?
if(max(large.panels) < N){warning('There is no time period during which all units are observed. Consider using pairwise estimation.')}
T.balanced <- names(large.panels[large.panels==max(large.panels)]) # Which years have max(valid observations)?
T.casewise <- length(T.balanced)
dat.balanced <- dat.pcse[dat.pcse[,groupT] %in% T.balanced,] # Extract biggest rectangular subset
dat.balanced <- dat.balanced[order(dat.balanced[,groupN], dat.balanced[,groupT]),]
e <- dat.balanced$e
## Calculate pcse as in Beck & Katz (1995)
E <- t(matrix(dat.balanced$e, N, T.casewise, byrow=TRUE))
Omega <- kronecker((crossprod(E) / T.casewise), Matrix(diag(1, T)))
}
## Finish evaluation, clean and output
salami <- t(X) %*% Omega %*% X
bread <- solve(crossprod(X))
sandwich <- bread %*% salami %*% bread
colnames(sandwich) <- names(coef(object))
row.names(sandwich) <- names(coef(object))
pcse <- sqrt(diag(sandwich))
b <- coef(object)
tstats <- b/pcse
df <- nobs - ncol(X)
pval <- 2*pt(abs(tstats), df, lower.tail=FALSE)
res <- list(vcov=sandwich, pcse=pcse, b=b, tstats=tstats, df=df, pval=pval, pairwise=pairwise,
nobs=nobs, nmiss=(N*T)-nobs, call=match.call())
class(res) <- "pcse"
return(res)
}
Look at the pcse package, which considers panel corrected standard errors. You certainly have to look at the documentation in STATA to figure out the assumptions made and cross check that with pcse.