How to calculate this by vector? - r

Updated: Now it's working, but still don't konw how the other way work.
cuts <- seq(from=3, to=36, by=0.01)
for (i in cuts) {
cut_off<- i
set.seed(666)
samp_h <-rnorm(1000,mean=12,sd=3)
samp_d <-rnorm(1000,mean=18,sd=6)
a <- sum(samp_h <= cut_off)
c <- sum(samp_h > cut_off)
b <- sum(samp_d <= cut_off)
d <- sum(samp_d > cut_off)
sens <- a / (a+c)
spci <- d / (d+b)
assign(paste("ss",as.character(cut_off),sep = ""), sens)
assign(paste("sp",as.character(cut_off),sep = ""), spci)}
ss_v<- unlist(
lapply(
paste0("ss",cuts),
get)
)
sp_v<- unlist(
lapply(
paste0("sp",cuts),
get)
)
plot(1-sp_v, ss_v)
Hi all:
I was trying to use different 'cut_off' to get different 'sens' (sensitive) and 'spci' (spcificity). The problem for code above is that for 34 'cuts', i can get the result. but if i change the cuts to:
cuts <- seq(from=3, to=36, by=0.01)
This method can't return the results. The problem is that I calculate the number in each vector, so I am asking how to use the vector to calculate the "ss_v" and "ss_p" directly. Thank you very much.
Background information:
Suppose that in ‘healthy’ patients antibody levels are distributed Normal(12,32) and in ‘diseased’ patients antibodies are distributed Normal(18,62). Note that these are ‘made up’ numbers and not intended to be realistic.
Simulate antibody counts for a large number of diseased and healthy patients (e.g. 1000 of each) – using the ‘rnorm’ function in R. What would the sensitivity and specificity be if a cutoff of 15 was chosen?
Record the sensitivity and specificity for a range of cutoffs between 3 and 36 (e.g. 3, 3.01, 3.02, …, 35.98, 35.99, 36). Hint: generate the cutoffs using the ‘seq’ function in R and then calculate the sensitivity and specificity using a ‘for’ loop or a vectorised calculation.
Produce a plot with ‘1-Specificity’ in the x-axis and ‘Sensitivity’ in the y-axis.

Your code is an example of trying to use R as a macro language. Better would be learning how to use R vectors properly. Since you used a for-loop, you should pre-allocate sens and spci and instead assign to sens and spci as indexed vectors. (So I am endorsing your request for a vector of results as teh sensible way to go.) Then give the vectors names, rather than littering your workspace with a profusion of individual, disconnected named objects. Try this instead:
cuts <- seq(from=3, to=36, by=1)
sens <- numeric(length(cuts)); spci=numeric(length(cuts))
for (i in cuts) {
cut_off<- i
set.seed(666)
samp_h <-rnorm(1000,mean=12,sd=3)
samp_d <-rnorm(1000,mean=18,sd=6)
hth <- table(samp_h)
dis <-table(samp_d)
a<-length(hth[names(hth) <= cut_off])
c<-length(hth[names(hth) > cut_off])
b <-length(dis[names(dis) <= cut_off])
d <-length(dis[names(dis) > cut_off])
sens[i] <- a / (a+c)
spci[1] <- d / (d+b)
}
names(sens) <- paste0("ss",cuts)
names(spci) <- paste0("sp",cuts)
I don't think the notion of working on a new simulated dataset with every loop iteration really impresses me with its efficiency, but it might be if you simulating something with a diff. I'm also not sure you have constructed sens and spci as sensitivity and specificity correctly, but at least you now can see what the results look like. There are several packages that will construct ROC curves.
This is the reason I'm doubtful that your algorithm inside the loop is correct:
> sens
ss3 ss4 ss5 ss6 ss7 ss8 ss9 ss10 ss11 ss12 ss13
0.000 0.000 0.745 0.747 0.752 0.764 0.792 0.836 0.895 0.000 0.123
ss14 ss15 ss16 ss17 ss18 ss19 ss20 ss21 ss22 ss23 ss24
0.239 0.374 0.485 0.593 0.661 0.700 0.721 0.736 0.744 0.745 0.745
ss25 ss26 ss27 ss28 ss29 ss30 ss31 ss32 ss33 ss34 ss35
0.745 0.745 0.745 0.745 0.745 0.745 0.745 0.747 0.747 0.747 0.747
ss36 <NA> <NA>
0.747 0.747 0.747
It just doesn't look like a sensitivity result that I would expect. I might have used code like abcd <-table( samp_h >= cut_off, samp_d >= cutoff) to generate the values you have for a,b,c,d. You can then use matrix indexing with that table result. Another option might be to skip your table efforts and use this code block:
a <- sum(samp_h <= cut_off)
c <- sum(samp_h > cut_off)
b <- sum(samp_d <= cut_off)
d <- sum(samp_d > cut_off)
The sens-itivity results now look more sensible, but not so the spci results.` ( because I had the indexing wrong, now fixed in code below.)
cuts <- seq(from=3, to=36, by=1)
sens <- numeric(length(cuts)); spci=numeric(length(cuts))
set.seed(666)
samp_h <-rnorm(1000,mean=12,sd=3)
samp_d <-rnorm(1000,mean=18,sd=6)
#Only need to make the test data.frame once
dfrm <- data.frame( vals = c(samp_h, samp_d),
grp = c( rep("H", 1000), rep("D",1000) ) )
for (i in seq_along(cuts) ) {
cut_off<- i
abcd <- with(dfrm,
table(Test_res = vals > cut_off,
status=grp ) )
sens[i] <- abcd["TRUE","D"] / sum( abcd[, "D"])
spci[i] <- abcd["FALSE", "H"] / sum( abcd[, "H"])
}
names(sens) <- paste0("ss",cuts)
names(spci) <- paste0("sp",cuts)
plot( 1-spci, sens, type="b")
text( 1-spci[c(TRUE,FALSE,FALSE,FALSE,FALSE)]+.05,
# hack to print every 5th cutoff value
sens[c(TRUE,FALSE,FALSE,FALSE,FALSE)],
label=(3:36)[ c(TRUE,FALSE,FALSE,FALSE,FALSE)] )

Related

How to pass list elements to model in R?

I am quite new to the use of lists so I apologize if this problem may sound very dumb.
From an original set of 459,046 customers, I have created a function that splits and stores the base in several elements of a list.
sampled_list <- baseSample(dataset = clv_df_cbs, sample.size = 10000, seed = 12345)
Executing this function (baseSample) you will get a new object list, containing mutually exclusive groups of customers (each group will be made of 10,000 customers - apart from the last one who may be smaller, depending on the initial volume)
> sampled_list <- baseSample(dataset = clv_df_cbs, sample.size = 10000, seed = 12345)
[1] "Seed: 12345"
[1] "Total groups created: 46"
[1] "Group size: 10000"
In this case, the output is a list of 46 elements stored in the object called sample_list.
Now, I want to pass each of these 46 elements to a BTYD model that will forecast the number of transactions in the next 90 days (given the learnings from the input).
The reason why I cannot pass the full dataset to the BTYD model is because this model heavily uses mcmc, therefore there is a long time of calculation that stops the model to provide any output. So I have decided to generate forecasts running the same model several times (on sample big enough) until I manage to pass all the base as model input.
The operations that need to be performed on each of the elements are the following
# Estimate parameters for element1 of the list
pggg.draws1 <- pggg.mcmc.DrawParameters(element1,
mcmc = 1000, # number of MCMC steps
burnin = 250, # number of initial MCMC steps which are discarded
thin = 10, # only every thin-th MCMC step will be returned
chains = 2, # number of MCMC chains to be run
trace = 50) # print logging step every trace iteration
# generate draws for holdout period
pggg.xstar.draws1 <- mcmc.DrawFutureTransactions(element1, pggg.draws1)
# conditional expectations
element1$xstar.pggg <- apply(pggg.xstar.draws1, 2, mean)
# P(active)
element1$pactive.pggg <- mcmc.PActive(pggg.xstar.draws1)
# P(alive)
element1$palive.pggg <- mcmc.PAlive(pggg.draws1)
# show estimates for first few customers
head(element1[, c("x", "t.x", "x.star",
"xstar.pggg", "pactive.pggg", "palive.pggg")],50)
# report median cohort-level parameter estimates
round(apply(as.matrix(pggg.draws1$level_2), 2, median), 3)
# report mean over median individual-level parameter estimates
median.est1 <- sapply(pggg.draws1$level_1, function(draw) {
apply(as.matrix(draw), 2, median)
})
round(apply(median.est1, 1, mean), 3)
Ideally, the output should be stored straight into a new data.frame - so I can retrieve the Id and the forecast (amongst other stuff originally included in the dataset).
Here below some mock data to play with from a publicly available dataset.
library(BTYDplus)
library(tidyverse)
data("groceryElog")
dataset<-elog2cbs(groceryElog, T.cal = "2006-12-01")
# FUNCTION baseSample ####
baseSample <- function(dataset, sample.size, seed=NULL) {
seed.value <- if(is.null(seed)) {
as.numeric(format(Sys.Date(),"%Y"))*10000+as.numeric(format(Sys.Date(),"%m"))*100+as.numeric(format(Sys.Date(),"%d"))
} else {
seed
}
set.seed(seed.value)
# RE-ORDER DATA FRAME (SAME LENGTH)
data <- with(dataset, dataset[order(sample(cust, nrow(dataset))),])
# BUILD A LIST OF DFs
set.sample.size <- sample.size
data$cycles_group <- paste0("sample_", ceiling(1:nrow(data)/set.sample.size))
df_list <- split(data, data$cycles_group)
print(paste0("Seed: ", seed.value))
print(paste0("Total groups created: ", length(unique(data$cycles_group))))
print(paste0("Group size: ", set.sample.size))
return(df_list)
#print(df_list)
}
# ** OUTPUT: Base split in lists ####
sampled_list <- baseSample(dataset = dataset, sample.size = 100, seed = 12345)
Thanks
In base R, you can use lapply to iterate a function over the elements of a list and return a new list with the results of those iterations. After using your example code to generate a list called sampled_list...
# turn the code for the operations you want to perform on each list element into a function,
# with a couple of minor tweaks
thingy <- function(i) {
# Estimate parameters for element1 of the list
pggg.draws1 <- pggg.mcmc.DrawParameters(i,
mcmc = 1000, # number of MCMC steps
burnin = 250, # number of initial MCMC steps which are discarded
thin = 10, # only every thin-th MCMC step will be returned
chains = 2, # number of MCMC chains to be run
trace = 50) # print logging step every trace iteration
# generate draws for holdout period
pggg.xstar.draws1 <- mcmc.DrawFutureTransactions(i, pggg.draws1)
# conditional expectations
i$xstar.pggg <- apply(pggg.xstar.draws1, 2, mean)
# P(active)
i$pactive.pggg <- mcmc.PActive(pggg.xstar.draws1)
# P(alive)
i$palive.pggg <- mcmc.PAlive(pggg.draws1)
# show estimates for first few customers [commenting out for this iterated version]
# head(element1[, c("x", "t.x", "x.star", "xstar.pggg", "pactive.pggg", "palive.pggg")],50)
# report median cohort-level parameter estimates
round(apply(as.matrix(pggg.draws1$level_2), 2, median), 3)
# report mean over median individual-level parameter estimates
median.est1 <- sapply(pggg.draws1$level_1, function(draw) {
apply(as.matrix(draw), 2, median)
})
# get the bits you want in a named vector
z <- round(apply(median.est1, 1, mean), 3)
# convert that named vector of results into a one-row data frame to make collapsing easier
data.frame(as.list(z))
}
# now use lapply to iterate that function over the elements of your list
results <- lapply(sampled_list, thingy)
# now bind the results into a data frame
boundresults <- do.call(rbind, results)
Results (which took a while to get):
k lambda mu tau z
sample_1 4.200 0.174 0.091 102.835 0.27
sample_10 3.117 0.149 0.214 128.143 0.29
sample_11 4.093 0.154 0.115 130.802 0.30
sample_12 4.191 0.142 0.053 114.108 0.33
sample_13 2.605 0.155 0.071 160.743 0.35
sample_14 9.196 0.210 0.084 111.747 0.36
sample_15 2.005 0.145 0.091 298.872 0.40
sample_16 2.454 0.111 0.019 78731750.121 0.70
sample_2 2.808 0.138 0.059 812.278 0.40
sample_3 4.327 0.166 0.116 559.318 0.42
sample_4 9.266 0.166 0.038 146.283 0.40
sample_5 3.277 0.157 0.073 105.915 0.33
sample_6 9.584 0.184 0.086 118.299 0.31
sample_7 4.244 0.189 0.118 54.945 0.23
sample_8 4.388 0.147 0.085 325.054 0.36
sample_9 7.898 0.181 0.052 83.892 0.33
You can also combine those last two steps into a single line of do.call(rbind, lapply(...)). If you want to make the row names in the results table into a column, you could do boundresults$sample <- row.names(boundresults) after making that table. And if you don't like creating new objects in your environment, you could put that function inside the call to lapply, i.e., lapply(sampled_list, function(i) { [your code] }).

Simulation/Optimization Package in R for tuning weights to achieve maximum allocation for groups

I am looking to identify the simulation package in R to identify the perfect weights, that enables me allocate my datapoints into the maximum bucket.
Basically, i want to tune my weights in a such a way the achieve my goal.
Below is the example.
Score1,Score2,Score3,Final,Group
0.87,0.73,0.41,0.63,"60-100"
0.82,0.73,0.85,0.796,"70-80"
0.82,0.37,0.85,0.652,"60-65"
0.58,0.95,0.42,0.664,"60-65"
1,1,0.9,0.96,"90-100"
Weight1,Weight2,Weight3
0.2,0.4,0.4
Final Score= Score1*Weight1+ Score2*Weight2+Score3*Weight3
The sum of my weights is 1. W1+W2+W3=1
i want to tune my weights in such a way that most of my cases lie into the "90-100" bucket. I know there won't be a perfect combination, but want to capture the maximum cases. I am currently trying to do the same in excel manually, using Pivot, but want to know if there is any package in R, that helps me to achieve my goal.
THe group allocation "70-80" "80-90" is something i have made in excel, using if else condition.
R Pivot Result:
"60-100",1
"60-65",2
"70-80",1
"90-100",1
Would appreciate if someone can help me to for the same.
Thanks,
Here's an approach that tries to get all the final scores as close as possible to 0.9 using a nested optimisation approach.
Here's your original data:
# Original data
df <- read.table(text = "Score1, Score2, Score3
0.87,0.73,0.41
0.82,0.73,0.85
0.82,0.37,0.85
0.58,0.95,0.42
1,1,0.9", header = TRUE, sep = ",")
This is the cost function for the first weight.
# Outer cost function
cost_outer <- function(w1){
# Run nested optimisation
res <- optimise(cost_nested, lower = 0, upper = 1 - w1, w1 = w1)
# Spit second weight into a global variable
res_outer <<- res$minimum
# Return the cost function value
res$objective
}
This is the cost function for the second weight.
# Nested cost function
cost_nested <- function(w2, w1){
# Calculate final weight
w <- c(w1, w2, 1 - w2 -w1)
# Distance from desired interval
res <- 0.9 - rowSums(w*df)
# Zero if negative distance, square distance otherwise
res <- sum(ifelse(res < 0, 0, res^2))
}
Next, I run the optimisation.
# Repackage weights
weight <- c(optimise(cost_outer, lower = 0, upper = 1)$minimum, res_outer)
weight <- c(weight, 1 - sum(weight))
Finally, I show the results.
# Final scores
cbind(df, Final = rowSums(weight * df))
# Score1 Score2 Score3 Final
# 1 0.87 0.73 0.41 0.7615286
# 2 0.82 0.73 0.85 0.8229626
# 3 0.82 0.37 0.85 0.8267400
# 4 0.58 0.95 0.42 0.8666164
# 5 1.00 1.00 0.90 0.9225343
Notice, however, that this code gets the final scores as close as possible to the interval, which is different from getting the most scores in that interval. That can be achieved by switching out the nested cost function with something like:
# Nested cost function
cost_nested <- function(w2, w1){
# Calculate final weight
w <- c(w1, w2, 1 - w2 -w1)
# Number of instances in desired interval
res <- sum(rowSums(w*df) < 0.9)
}
This can be formulated as a Mixed Integer Programming (MIP) problem. The mathematical model can look like:
The binary variable δi indicates if final weight Fi is inside the interval [0.9,1]. M is "large" value (if all your data is between 0 and 1 we can choose M=1). ai,j is your data.
The objective function and all constraints are linear, so we can use standard MIP solvers to solve this problem. MIP solvers for R are readily available.
PS in the example groups overlap. That does not make much sense to me. I think if we have "90-100" we should not also have "60-100".
PS2. If all data is between 0 and 1, we can simplify the sandwich equation a bit: we can drop the right part.
For the small example data set I get:
---- 56 PARAMETER a
j1 j2 j3
i1 0.870 0.730 0.410
i2 0.820 0.730 0.850
i3 0.820 0.370 0.850
i4 0.580 0.950 0.420
i5 1.000 1.000 0.900
---- 56 VARIABLE w.L weights
j1 0.135, j2 0.865
---- 56 VARIABLE f.L final scores
i1 0.749, i2 0.742, i3 0.431, i4 0.900, i5 1.000
---- 56 VARIABLE delta.L selected
i4 1.000, i5 1.000
---- 56 VARIABLE z.L = 2.000 objective
(zeros are not printed)

trying to perform a t.test for each row and count all rows where p-value is less than 0.05

I've been wrecking my head for the past four hours trying to find the solution to an R problem, which is driving me nuts. I've searching everywhere for a decent answer but so far I've been hitting wall after wall. I am now appealing to your good will of this fine community for help.
Consider the following dataset:
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
I need to perform a t-test for every row in DataSample in order to find out if groups TRIAL and CONTROL differ (equal variance applies).
Then I need to count the number of rows with a p-value equal to, or lower than 0.05.
So here is the code I tried, which I know is wrong:
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
pValResults <- apply(
DataSample[,1:12],1,function(x) t.test(x,DataSample[,13:24], var.equal=T)$p.value
)
sum(pValResults < 0.05) # Returns the wrong answer (so I was told)
I did try looking at many similar questions around stackoverflow, but I would often end-up with syntax errors or a dimensional mismatch. The code above is the best I could get without returning me an R error -- but I since the code is returning the wrong answer I have nothing to feel proud of.
Any advice will be greatly appreciated! Thanks in advance for your time.
One option is to loop over the data set calculating the t test for each row, but it is not as elegant.
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
# initialize vector of stored p-values
pvalue <- rep(0,nrow(DataSample))
for (i in 1:nrow(DataSample)){
pvalue[i] <- t.test(DataSample[i,1:12],DataSample[i,13:24])$p.value
}
# finding number that are significant
sum(pvalue < 0.05)
I converted to a data.table, and the answer I got was 45:
DataSample.dt <- as.data.table(DataSample)
sum(sapply(seq_len(nrow(DataSample.dt)), function(x)
t.test(DataSample.dt[x, paste0('Trial', 1:12), with=F],
DataSample.dt[x, paste0('Control', 13:24), with=F],
var.equal=T)$p.value) < 0.05)
To do a paired T test, you need to supply the paired = TRUE parameter. The t.test function isn't vectorised, but it's quite simple to do t tests a whole matrix at a time. Here's three methods (including using apply):
library("genefilter")
library("matrixStats")
library("microbenchmark")
dd <- DataSample[, 1:12] - DataSample[, 13:24]
microbenchmark::microbenchmark(
manual = {ps1 <- 2 * pt(-abs(rowMeans(dd) / sqrt(rowVars(dd) / ncol(dd))), ncol(dd) - 1)},
apply = {ps2 <- apply(DataSample, 1, function(x) t.test(x[1:12], x[13:24], paired=TRUE)$p.value)},
rowttests = {ps3 <- rowttests(dd)[, "p.value"]})
#Unit: milliseconds
# expr min lq mean median uq max
# manual 1.611808 1.641783 1.677010 1.663122 1.709401 1.852347
# apply 390.869635 398.720930 404.391487 401.508382 405.715668 634.932675
# rowttests 2.368823 2.417837 2.639671 2.574320 2.757870 7.207135
# neval
# 100
# 100
# 100
You can see the manual method is over 200x faster than apply.
If you actually meant an unpaired test, here's the equivalent comparison:
microbenchmark::microbenchmark(
manual = {x <- DataSample[, 1:12]; y <- DataSample[, 13:24]; ps1 <- 2 * pt(-abs((rowMeans(x) - rowMeans(y)) / sqrt((rowVars(x) + rowVars(y)) / ncol(x))), ncol(DataSample) - 2)},
apply = { ps2 <- apply(DataSample, 1, function(x) t.test(x[1:12], x[13:24], var.equal = TRUE)$p.value)},
rowttests = {ps3 <- rowttests(DataSample, factor(rep(1:2, each = 12)))[, "p.value"]})
Note the manual method assumes that the two groups are the same sizes.
Adding an alternative using an external library.
Performing the test:
library(matrixTests)
res <- row_t_equalvar(DataSample[,1:12], DataSample[,13:24])
Format of the result:
res
obs.x obs.y obs.tot mean.x mean.y mean.diff var.x var.y var.pooled stderr df statistic pvalue conf.low conf.high alternative mean.null conf.level
1 12 12 24 0.30569721 0.160622830 0.145074376 0.5034806 1.0769678 0.7902242 0.3629105 22 0.399752487 0.69319351 -0.6075559 0.89770469 two.sided 0 0.95
2 12 12 24 -0.27463354 -0.206396781 -0.068236762 0.8133311 0.2807800 0.5470556 0.3019535 22 -0.225984324 0.82329990 -0.6944500 0.55797651 two.sided 0 0.95
3 12 12 24 -0.19805092 -0.023207888 -0.174843032 0.4278359 0.5604078 0.4941219 0.2869733 22 -0.609265949 0.54858909 -0.7699891 0.42030307 two.sided 0 0.95
Number of rows with p <= 0.05:
> sum(res$pvalue <= 0.05)
[1] 4

How to separate data frame in R into two separate data frames for Stepwise Regression in SuperLearner

I have a dataset 162 x 152. What I want to do is use stepwise regression, incorporating cross validation on the dataset to create a model and to test how accurate that model is.
ID RT (seconds) 76_TI2 114_DECC 120_Lop 212_PCD 236_X3Av
4281 38 4.086 1.2 2.322 0 0.195
4952 40 2.732 0.815 1.837 1.113 0.13
4823 41 4.049 1.153 2.117 2.354 0.094
3840 41 4.049 1.153 2.117 3.838 0.117
3665 42 4.56 1.224 2.128 2.38 0.246
3591 42 2.96 0.909 1.686 0.972 0.138
This is part of the dataset I have. I want to construct a model where my Y variable is RT(seconds) and all my variables (my predictors) are all the other 151 variables in my dataset. I was told to use the superleaner package, and algorithm for that is:-
test <- CV.SuperLearner(Y = Y, X = X, V = 10, SL.library = SL.library,
verbose = TRUE, method = "method.NNLS")
The problem is that I'm still rather new to R. The main way in which I've been reading my data in and performing other forms of machine learning algorithms onto my data is by doing the following:-
mydata <- read.csv("filepathway")
fit <- lm(RT..seconds~., data=mydata)
So how do I go about separating the RT seconds column from the input of my data so that I can input the things as an X and Y dataframe? i.e something along the lines of:-
mydata <- read.csv("filepathway")
mydata$RT..seconds. = Y #separating my Y response variable
Alltheother151variables = X #separating all of my X predictor variables (all 151 of them)
SL.library <- c("SL.step")
test <- CV.SuperLearner(Y (i.e RT seconds column), X (all the other 151 variables that corresponds to the RT values), V = 10, SL.library = SL.library,
verbose = TRUE, method = "method.NNLS")
I hope this all makes sense. Thanks!
If the response variable is in the first column, you can simply use:
Y <- mydata[ , 1 ]
X <- mydata[ , -1 ]
The first argument of [ (the row number) is empty, so we keep all the rows,
and the second is either 1 (the first column) or -1 (everything but the first column).
If your response variable is elsewhere, you can use the column names instead:
Y <- mydata[ , "RT..seconds." ]
X <- mydata[ , setdiff( colnames(mydata), "RT..seconds." ) ]

xtpcse from Stata - how to rewrite in R

I am currently learning R. I have no previous knowledge of STATA.
I want to reanalyze a study which was done in Stata (xtpcse linear regression with panel-corrected standard errors). I could not find the model or more detailed code in Stata or any other hint how to rewrite this in R. I have the plm package for econometrics installed for R. That's as far as I got.
The first lines of the .do file from STATA are copied below (I just saw that it's pretty unreadable. Here is a link to the txt file in which I copied the .do content: http://dl.dropbox.com/u/4004629/This%20was%20in%20the%20.do%20file.txt).
I have no idea of how to go about this in a better way. I tried google-ing STATA and R comparison and the like but it did not work.
All data for the study I want to replicate are here:
https://umdrive.memphis.edu/rblanton/public/ISQ_data
---STATA---
Group variable: c_code Number of obs = 265
Time variable: year Number of groups = 27
Panels: correlated (unbalanced) Obs per group: min = 3
Autocorrelation: common AR(1) avg = 9.814815
Sigma computed by pairwise selection max = 14
Estimated covariances = 378 R-squared = 0.8604
Estimated autocorrelations = 1 Wald chi2(11) = 8321.15
Estimated coefficients = 15 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Panel-corrected
food | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lag_food | .8449038 .062589 13.50 0.000 .7222316 .967576
ciri | -.010843 .0222419 -0.49 0.626 -.0544364 .0327504
human_cap | .0398406 .0142954 2.79 0.005 .0118222 .0678591
worker_rts | -.1132705 .0917999 -1.23 0.217 -.2931951 .066654
polity_4 | .0113995 .014002 0.81 0.416 -.0160439 .0388429
market_size | .0322474 .0696538 0.46 0.643 -.1042716 .1687665
income | .0382918 .0979499 0.39 0.696 -.1536865 .2302701
econ_growth | .0145589 .0105009 1.39 0.166 -.0060224 .0351402
log_trade | -.3062828 .1039597 -2.95 0.003 -.5100401 -.1025256
fix_dollar | -.0351874 .1129316 -0.31 0.755 -.2565293 .1861545
fixed_xr | -.4941214 .2059608 -2.40 0.016 -.897797 -.0904457
xr_fluct | .0019044 .0106668 0.18 0.858 -.0190021 .0228109
lab_growth | .0396278 .0277936 1.43 0.154 -.0148466 .0941022
english | -.1594438 .1963916 -0.81 0.417 -.5443641 .2254766
_cons | .4179213 1.656229 0.25 0.801 -2.828227 3.66407
-------------+----------------------------------------------------------------
rho | .0819359
------------------------------------------------------------------------------
. xtpcse fab_metal lag_fab_metal ciri human_cap worker_rts polity_4 market
> income econ_growth log_trade fix_dollar fixed_xr xr_fluct lab_growth
> english, pairwise corr(ar1)
Update:
I just tried Vincent's code. I tried the pcse2 and vcovBK code, and they both worked (even though I'm not sure what to do with the correlation matrix that comes out of vcocBK).
However, I still have troubles reproducing the estimates of the regression coefficients in the paper I'm reanalyzing. I'm following their recipe as good as I can, the only step I'm missing is, I think, the part where in Stata "Autocorrelation: common AR(1)" is done. The paper I'm analyzing says: "OLS regression using panel corrected standard errors (Beck/Katz '95), control for first order correlation within each panel (corr AR1 option in Stata)."
How do I control for first order correlation within each panel in R?
Here is what I did so far on my data:
## run lm
res.lm <- lm(total_FDI ~ ciri + human_cap + worker_rts + polity_4 + lag_total + market_size + income + econ_growth + log_trade + fixed_xr + fix_dollar + xr_fluct + english + lab_growth, data=D)
## run pcse
res.pcse <- pcse2(res.lm,groupN="c_code",groupT="year",pairwise=TRUE)
As Ramnath mentioned, the pcse package will do what Stata's xtpcse does. Alternatively, you could use the vcovBK() function from the plm package. If you opt for the latter option, make sure you use the cluster='time' option, which is what the Beck & Katz (1995) article suggests and what the Stata command implements.
The pcse package works well, but there are some issues that makes a lot of intuitive user inputs unacceptable, especially if your dataset is unbalanced. You might want to try this re-write of the function that I coded a while ago. Just load the pcse package, load the pcse2 function, and use it by following the instructions in the pcse documentation. IMHO, the function pasted below is cleaner, more flexible and more robust than the one provided by the pcse folks. Simple benchmarks also suggest that my version may be 5 to 10 times faster than theirs, which may matter for big datasets.
Good luck!
library(Matrix)
pcse2 <- function(object, groupN, groupT, pairwise=TRUE){
## Extract basic model info
groupT <- tail(as.character((match.call()$groupT)), 1)
groupN <- tail(as.character((match.call()$groupN)), 1)
dat <- eval(parse(text=object$call$data))
## Sanity checks
if(!"lm" %in% class(object)){stop("Formula object must be of class 'lm'.")}
if(!groupT %in% colnames(dat)){stop(paste(groupT, 'was not found in data', object$call$data))}
if(!groupN %in% colnames(dat)){stop(paste(groupN, 'was not found in data', object$call$data))}
if(anyDuplicated(paste(dat[,groupN], dat[,groupT]))>0){stop(paste('There are duplicate groupN-groupT observations in', object$call$data))}
if(length(dat[is.na(dat[,groupT]),groupT])>0){stop('There are missing unit indices in the data.')}
if(length(dat[is.na(dat[,groupN]),groupN])>0){stop('There are missing time indices in the data.')}
## Expand model frame to include groupT, groupN, resid columns.
f <- as.formula(object$call$formula)
f.expanded <- update.formula(f, paste(". ~ .", groupN, groupT, sep=" + "))
dat.pcse <- model.frame(f.expanded, dat)
dat.pcse$e <- resid(object)
## Extract basic model info (part II)
N <- length(unique(dat.pcse[,groupN]))
T <- length(unique(dat.pcse[,groupT]))
nobs <- nrow(dat.pcse)
is.balanced <- length(resid(object)) == N * T
## If balanced dataset, calculate as in Beck & Katz (1995)
if(is.balanced){
dat.pcse <- dat.pcse[order(dat.pcse[,groupN], dat.pcse[,groupT]),]
X <- model.matrix(f, dat.pcse)
E <- t(matrix(dat.pcse$e, N, T, byrow=TRUE))
Omega <- kronecker((crossprod(E) / T), Matrix(diag(1, T)) )
## If unbalanced and pairwise, calculate as in Franzese (1996)
}else if(pairwise==TRUE){
## Rectangularize
rectangle <- expand.grid(unique(dat.pcse[,groupN]), unique(dat.pcse[,groupT]))
names(rectangle) <- c(groupN, groupT)
rectangle <- merge(rectangle, dat.pcse, all.x=TRUE)
rectangle <- rectangle[order(rectangle[,groupN], rectangle[,groupT]),]
valid <- ifelse(is.na(rectangle$e),0,1)
rectangle[is.na(rectangle)] <- 0
X <- model.matrix(f, rectangle)
X[valid==0,1] <- 0
## Calculate pcse
E <- crossprod(t(matrix(rectangle$e, N, T, byrow=TRUE)))
V <- crossprod(t(matrix(valid, N, T, byrow=TRUE)))
if (length(V[V==0]) > 0){stop("Error! A CS-unit exists without any obs or without any obs in a common period with another CS-unit. You must remove that unit from the data passed to pcse().")}
Omega <- kronecker(E/V, Matrix(diag(1, T)))
## If unbalanced and casewise, caluate based on largest rectangular subset of data
}else{
## Rectangularize
rectangle <- expand.grid(unique(dat.pcse[,groupN]), unique(dat.pcse[,groupT]))
names(rectangle) <- c(groupN, groupT)
rectangle <- merge(rectangle, dat.pcse, all.x=TRUE)
rectangle <- rectangle[order(rectangle[,groupN], rectangle[,groupT]),]
valid <- ifelse(is.na(rectangle$e),0,1)
rectangle[is.na(rectangle)] <- 0
X <- model.matrix(f, rectangle)
X[valid==0,1] <- 0
## Keep only years for which we have the max number of observations
large.panels <- by(dat.pcse, dat.pcse[,groupT], nrow) # How many valid observations per year?
if(max(large.panels) < N){warning('There is no time period during which all units are observed. Consider using pairwise estimation.')}
T.balanced <- names(large.panels[large.panels==max(large.panels)]) # Which years have max(valid observations)?
T.casewise <- length(T.balanced)
dat.balanced <- dat.pcse[dat.pcse[,groupT] %in% T.balanced,] # Extract biggest rectangular subset
dat.balanced <- dat.balanced[order(dat.balanced[,groupN], dat.balanced[,groupT]),]
e <- dat.balanced$e
## Calculate pcse as in Beck & Katz (1995)
E <- t(matrix(dat.balanced$e, N, T.casewise, byrow=TRUE))
Omega <- kronecker((crossprod(E) / T.casewise), Matrix(diag(1, T)))
}
## Finish evaluation, clean and output
salami <- t(X) %*% Omega %*% X
bread <- solve(crossprod(X))
sandwich <- bread %*% salami %*% bread
colnames(sandwich) <- names(coef(object))
row.names(sandwich) <- names(coef(object))
pcse <- sqrt(diag(sandwich))
b <- coef(object)
tstats <- b/pcse
df <- nobs - ncol(X)
pval <- 2*pt(abs(tstats), df, lower.tail=FALSE)
res <- list(vcov=sandwich, pcse=pcse, b=b, tstats=tstats, df=df, pval=pval, pairwise=pairwise,
nobs=nobs, nmiss=(N*T)-nobs, call=match.call())
class(res) <- "pcse"
return(res)
}
Look at the pcse package, which considers panel corrected standard errors. You certainly have to look at the documentation in STATA to figure out the assumptions made and cross check that with pcse.

Resources