Looping regressions and running column sum based on results - r

I have a data frame with panel data that looks as follows:
countrycode year 7111 7112 7119 7126 7129 7131 7132 7133 7138
1 AGO 1981 380491 149890 238832 0 166690 449982 710642 430481 890546
2 AGO 1982 339626 66434 183487 0 79682 108356 486799 186884 220545
3 AGO 1983 128043 2697 91404 148617 3988 432725 829958 138764 152822
4 AGO 1984 67832 0 85613 1251 45644 361733 1250272 237236 2952746
5 AGO 1985 354335 11225 143000 2130 7687 2204297 942071 408907 474666
There are 159 four-digit column variables like the ones shown above. There are also column variables named CEPI1_fw and CIPI1_fw. Furthermore, there are 46 countries and 34 years in the data set.
I would like to use the plm command to regress each of the numerical column variables on CEPI1_fw and CIPI1_fw. Then, I would like to sum the numerical column variables in the data frame above based on whether the coefficients from the regressions are above or below a certain threshold. The resulting output should be a pair of columns added to the data frame above.

There are a few ambiguities in your question, but I'll take a shot.
First, I'm going to revamp your code slightly: adding rows to data frames is very inefficient (probably doesn't matter in this application, but it's a bad habit to get into ...)
out <- list()
for (i in colnames(master5)) {
f <- reformulate(c("CEPI1_fw","CIPI1_fw"),
response=paste0("master5$",i))
m <- summary(plm(f, data = master4, model = "within"))
out <- c(out, list(data.frame(yvar=i, coef=m$coefficients[1,1],
pval= m$coefficients[1,4],
stringsAsFactors=FALSE)))
}
out <- do.call(rbind, out) ## combine elements into a single data frame
Select only statistically significant response variables. From a statistical/inferential point of view, this is probably a bad idea ...
out <- out[out$pval<0.05,]
Select the names of variables where the coefficients are above a threshold
big_vars <- out$yvar[abs(out$coef)>threshold]
Compute column sums from another data set ...
colSums(other_data[big_vars])

Related

Conditional reordering of values within column (permutation) in R

I am working on a research paper on graph manipulation and I have the following data:
returns 1+returns cum_return price period_ret(step=25)
1 7.804919e-03 1.0078049 0.007804919 100.78355 NA
2 3.560800e-03 1.0035608 0.011393511 101.14306 NA
3 -1.490719e-03 0.9985093 0.009885807 100.99239 NA
. -2.943304e-03 0.9970567 0.006913406 100.69558 NA
. 1.153007e-03 1.0011530 0.008074385 100.81175 NA
. -2.823012e-03 0.9971770 0.005228578 100.52756 NA
25 -7.110762e-03 0.9928892 -0.001919363 99.81526 -0.02364
. -1.807268e-02 0.9819273 -0.019957356 98.02754 NA
. -3.300315e-03 0.9966997 -0.023191805 97.70455 NA
250 5.846750e-03 1.0058467 -0.017480652 98.27748 0.12125
These are 250 daily stock returns, the cummulative return, price and the 25-day period returns (returns between days 0-25; 25-50;...;200-250).
What I want to do is the following:
I want to rearrange the returns but the period returns should be identical although their order can change. So there are 10! possible combinations of the subsets.
What I did so far: I wrote a code using the sample, repeat and identical functions and here is a shortened version:
repeat{
temp <- tibble(
returns = sample(x$returns, 250, replace=TRUE) )
if(identical(sort(round(c(x$period_ret[(!is.na(x$period_ret))]),2)),sort(round(c(temp$period_ret[(!is.na(temp$period_ret))]),2)))) break
}
This took me quite some time and unfortunately it isn't of any real use. Only later I began thinking of the math and that there are 250! possible samples so I would spend days waiting for any result.
What do I need this for?
I would like to create graphs with different orders of the returns. Thus, all the graphs have the same summary statistics but look different. Its important that they have the same period_returns (no matter of their order) to fulfil a utility formula.

How do I generate a dataframe displaying the number of unique pairs between two vectors, for each unique value in one of the vectors?

First of all, I apologize for the title. I really don't know how to succinctly explain this issue in one sentence.
I have a dataframe where each row represents some aspect of a hospital visit by a patient. A single patient might have thousands of rows for dozens of hospital visits, and each hospital visit could account for several rows.
One column is Medical.Record.Number, which corresponds to Patient IDs, and the other is Patient.ID.Visit, which corresponds to an ID for an individual hospital visit. I am trying to calculate the number of hospital visits each each patient has had.
For example:
Medical.Record.Number    Patient.ID.Visit
AAAXXX           1111
AAAXXX           1112
AAAXXX           1113
AAAZZZ           1114
AAAZZZ           1114
AAABBB           1115
AAABBB           1116
would produce the following:
Medical.Record.Number   Number.Of.Visits
AAAXXX          3
AAAZZZ          1
AAABBB          2
The solution I am currently using is the following, where "data" is my dataframe:
#this function returns the number of unique hospital visits associated with the
#supplied record number
countVisits <- function(record.number){
visits.by.number <- data$Patient.ID.Visit[which(data$Medical.Record.Number
== record.number)]
return(length(unique(visits.by.number)))
}
recordNumbers <- unique(data$Medical.Record.Number)
visits <- integer()
for (record in recordNumbers){
visits <- c(visits, countVisits(record))
}
visit.counts <- data.frame(recordNumbers, visits)
This works, but it is pretty slow. I am dealing with potentially millions of rows of data, so I'd like something efficient. From what little I know about R, I know there's usually a faster way to do things without using a for-loop.
This essentially looks like a table() operation after you take out duplicates. First, some sample data
#sample data
dd<-read.table(text="Medical.Record.Number Patient.ID.Visit
AAAXXX 1111
AAAXXX 1112
AAAXXX 1113
AAAZZZ 1114
AAAZZZ 1114
AAABBB 1115
AAABBB 1116", header=T)
then you could do
tt <- table(Medical.Record.Number=unique(dd)$Medical.Record.Number)
as.data.frame(tt, responseName="Number.Of.Visits") #to get a data.frame rather than named vector (table)
# Medical.Record.Number Number.Of.Visits
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
Or you could also think of this as an aggregation problem
aggregate(Patient.ID.Visit~Medical.Record.Number, dd, function(x) length(unique(x)))
# Medical.Record.Number Patient.ID.Visit
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
There are many ways to do this, #MrFlick provided handful of perfectly valid approaches. Personally I'm fond of the data.table package. Its faster on large data frames and I find the logic to be more intuitive than the base functions. I'd check it out if you are having problems with execution time.
library(data.table)
med.dt <- data.table(med_tbl)
num.visits.dt <- med.dt[ , num_visits = length(unique(Patient.ID.Visit)),
by = Medical.Record.Number]
data.Table should be much faster than data.frame on a large tables.

R predict function not using entire data in the test data set, only using partial data and predicting

I have a train data set which has 700 records. I prepared the model using c5.0 function with this data.
library(C50)
abc_model <- C5.0(abc_train[-5], abc_train$resultval)
I have test data, which has 5000 records.
I am using predict function to do the prediction on these 5000 recs.
abc_Test <- read.csv("FullData.csv", quote="")
abc_pred <- predict(abc_model, abc_test)
This is giving me the prediction for ONLY 700 recs, not all 5000.
How to make this predict for all 5000?
When I have the train data size larger than test data size, then the result is fine, I get all data, I am able to combine test data with results and get the output into ".CSV". But when train data size is smaller than test data, all records are not getting predicted.
x <- data.frame(abc_test, abc_pred)
Any inputs how to overcome this problem? I am not an expert in R. Any suggestions will help me a lot.
Thanks Richard.
Below is my train data, few recs.
Id Value1 Value2 Country Result
20835 63 1 United States yes
3911156 60 12 Romania no
39321 10 3 United States no
29425 80 9 Australia no
Below is my test data, few recs again.
Id Value1 Value2 Country
3942587 114 12 United States
3968314 25 13 Sweden
3973205 83 10 Russian Federation
17318 159 9 Russian Federation
I am trying to find the Result value and append this to my test data. But, like i described, I am getting the Result only for 700 records, not all 5000
You should try this:
str(abc_train)
str(abc_test)
lapply(abc_train[ names(abc_train) != "Result"] , table)
lapply(abc_train[] , table)
Then you will probably find that some of the levels for some of the variables in abc_test were not in abc_train, so estimates could not be produced. I'm guessing you thought that the numeric values would be handled as though a regression had been done, but that won't happen if those columns are factors in any prediction function and perhaps never depending on the function's behavior. Looking at C50::C5.0.default, it appears there may be no regression option for variables.

Eliminating Existing Observations in a Zoo Merge

I'm trying to do a zoo merge between stock prices from selected trading days and observations about those same stocks (we call these "Nx observations") made on the same days. Sometimes do not have Nx observations on stock trading days and sometimes we have Nx observations on non-trading days. We want to place an "NA" where we do not have any Nx observations on trading days but eliminate Nx observations where we have them on non-trading day since without trading data for the same day, Nx observations are useless.
The following SO question is close to mine, but I would characterize that question as REPLACING missing data, whereas my objective is to truly eliminate observations made on non-trading days (if necessary, we can change the process by which Nx observations are taken, but it would be a much less expensive solution to leave it alone).
merge data frames to eliminate missing observations
The script I have prepared to illustrate follows (I'm new to R and SO; all suggestions welcome):
# create Stk_data data.frame for use in the Stack Overflow question
Date_Stk <- c("1/2/13", "1/3/13", "1/4/13", "1/7/13", "1/8/13") # dates for stock prices used in the example
ABC_Stk <- c(65.73, 66.85, 66.92, 66.60, 66.07) # stock prices for tkr ABC for Jan 1 2013 through Jan 8 2013
DEF_Stk <- c(42.98, 42.92, 43.47, 43.16, 43.71) # stock prices for tkr DEF for Jan 1 2013 through Jan 8 2013
GHI_Stk <- c(32.18, 31.73, 32.43, 32.13, 32.18) # stock prices for tkr GHI for Jan 1 2013 through Jan 8 2013
Stk_data <- data.frame(Date_Stk, ABC_Stk, DEF_Stk, GHI_Stk) # create the stock price data.frame
# create Nx_data data.frame for use in the Stack Overflow question
Date_Nx <- c("1/2/13", "1/4/13", "1/5/13", "1/6/13", "1/7/13", "1/8/13") # dates for Nx Observations used in the example
ABC_Nx <- c(51.42857, 51.67565, 57.61905, 57.78349, 58.57143, 58.99564) # Nx scores for stock ABC for Jan 1 2013 through Jan 8 2013
DEF_Nx <- c(35.23809, 36.66667, 28.57142, 28.51778, 27.23150, 26.94331) # Nx scores for stock DEF for Jan 1 2013 through Jan 8 2013
GHI_Nx <- c(7.14256, 8.44573, 6.25344, 6.00423, 5.99239, 6.10034) # Nx scores for stock GHI for Jan 1 2013 through Jan 8 2013
Nx_data <- data.frame(Date_Nx, ABC_Nx, DEF_Nx, GHI_Nx) # create the Nx scores data.frame
# create zoo objects & merge
z.Stk_data <- zoo(Stk_data, as.Date(as.character(Stk_data[, 1]), format = "%m/%d/%Y"))
z.Nx_data <- zoo(Nx_data, as.Date(as.character(Nx_data[, 1]), format = "%m/%d/%Y"))
z.data.outer <- merge(z.Stk_data, z.Nx_data)
The NAs on Jan 3 2013 for the Nx observations are fine (we'll use the na.locf) but we need to eliminate the Nx observations that appear on Jan 5 and 6 as well as the associated NAs in the Stock price section of the zoo objects.
I've read the R Documentation for merge.zoo regarding the use of "all": that its use "allows
intersection, union and left and right joins to be expressed". But trying all combinations of the
following use of "all" yielded the same results (as to why would be a secondary question).
z.data.outer <- zoo(merge(x = Stk_data, y = Nx_data, all.x = FALSE)) # try using "all"
While I would appreciate comments on the secondary question, I'm primarily interested in learning how to eliminate the extraneous Nx observations on days when there is no trading of stocks. Thanks. (And thanks in general to the community for all the great explanations of R!)
The all argument of merge.zoo must be (quoting from the help file):
logical vector having the same length as the number of "zoo" objects to be merged
(otherwise expanded)
and you want to keep all rows from the first argument but not the second so its value should be c(TRUE, FALSE).
merge(z.Stk_data, z.Nx_data, all = c(TRUE, FALSE))
The reason for the change in all syntax for merge.zoo relative to merge.data.frame is that merge.zoo can merge any number of arguments whereas merge.data.frame only handles two so the syntax had to be extended to handle that.
Also note that %Y should have been %y in the question's code.
I hope I have understood your desired output correctly ("NAs on Jan 3 2013 for the Nx observations are fine"; "eliminate [...] observations that appear on Jan 5 and 6"). I don't quite see the need for zoo in the merging step.
merge(Stk_data, Nx_data, by.x = "Date_Stk", by.y = "Date_Nx", all.x = TRUE)
# Date_Stk ABC_Stk DEF_Stk GHI_Stk ABC_Nx DEF_Nx GHI_Nx
# 1 1/2/13 65.73 42.98 32.18 51.42857 35.23809 7.14256
# 2 1/3/13 66.85 42.92 31.73 NA NA NA
# 3 1/4/13 66.92 43.47 32.43 51.67565 36.66667 8.44573
# 4 1/7/13 66.60 43.16 32.13 58.57143 27.23150 5.99239
# 5 1/8/13 66.07 43.71 32.18 58.99564 26.94331 6.10034

R: Percentile calculations on subsets of data

I have a data set which contains the following identifiers, an rscore, gvkey, sic2, year, and cdom. What I am looking to do is calculate percentile ranks based on summed rscores for all temporal spans (~1500) for a given gvkey, and then calculate percentile ranks in a given temporal time span and sic2 based on gvkey.
Calculating the percentiles for all temporal time spans is a fairly quick process, however once I add in calculating the sic2 percentile ranks it's fairly slow, but we are likely looking at about ~65,000 subsets in total. I'm wondering if there is a possibility of speeding up this process.
The data for one temporal time span looks like the following
gvkey sic2 cdom rscoreSum pct
1187 10 USA 8.00E-02 0.942268617
1265 10 USA -1.98E-01 0.142334654
1266 10 USA 4.97E-02 0.88565478
1464 10 USA -1.56E-02 0.445748247
1484 10 USA 1.40E-01 0.979807985
1856 10 USA -2.23E-02 0.398252565
1867 10 USA 4.69E-02 0.8791019
2047 10 USA -5.00E-02 0.286701209
2099 10 USA -1.78E-02 0.430915371
2127 10 USA -4.24E-02 0.309255308
2187 10 USA 5.07E-02 0.893020421
The code to calculate the industry ranks is below, and fairly straightforward.
#generate 2 digit industry SICs percentile ranks
dout <- ddply(dfSum, .(sic2), function(x){
indPct <- rank(x$rscoreSum)/nrow(x)
gvkey <- x$gvkey
x <- data.frame(gvkey, indPct)
})
#merge 2 digit industry SIC percentile ranks with market percentile ranks
dfSum <- merge(dfSum, dout, by = "gvkey")
names(dfSum)[2] <- 'sic2'
Any suggestions to speed the process would be appreciated!
You might try the data.table package for fast operations across relatively large datasets like yours. For example, my machine has no problem working through this:
library(data.table)
# Create a dataset like yours, but bigger
n.rows <- 2e6
n.sic2 <- 1e4
dfSum <- data.frame(gvkey=seq_len(n.rows),
sic2=sample.int(n.sic2, n.rows, replace=TRUE),
cdom="USA",
rscoreSum=rnorm(n.rows))
# Now make your dataset into a data.table
dfSum <- data.table(dfSum)
# Calculate the percentiles
# Note that there is no need to re-assign the result
dfSum[, indPct:=rank(rscoreSum)/length(rscoreSum), by="sic2"]
whereas the plyr equivalent takes a while.
If you like the plyr syntax (I do), you may also be interested in the dplyr package, which is billed as "the next generation of plyr", with support for faster data stores in the backend.

Resources