PLM in r, variable standardization and lag - r

My goal is to understand whether clicks (say, on a website) or calls (say, over the phone to new contacts) have a greater impact on new sign-ups over a 90 week period in the USA. I thought this problem lent itself well to PLM.
My data consist of 210 DMA regions (DMA), 90 weeks (WEEK), new sign-ups (NEW) (outcome) and two predictors (CALLS and CLICKS), so for example:
plm_model <- plm(NEW ~ CALLS + CLICKS, data=df, index=c("DMA", "WEEK"), model="within")
First, does one standardize the predictors and outcome across all panels (so in total) or within each panel (as in, uniquely standardizing to each DMA)? The models differ when I did both, which it should, but I can't quite find any documentation on which is "more correct" or why one would do one or the other.
Second, when I look at the datafile across all DMA regions but by time, there is a lag of 4 periods on CLICKS x NEW but 20 period on CALLS x NEW. I have arrived at those using:
which.max(sapply(1:50, function(i) cor(df$NEW, lag(df$CLICKS, i), use = "complete")))
which.max(sapply(1:50, function(i) cor(df$NEW, lag(df$CALLS, i), use = "complete")))
This does make sense for the data-- there is more "NEW" sooner after "CLICKS" but "CALLS" take more time to pay off.
I thought the logical next step would be lag according to the correlation and then standardize based on the data available now (as in, fewer observations now), and then I was stuck if within or across panels (as mentioned above), and then I was going to do something like:
plm_model <- plm(NEW ~ CALLS_20_standardized + CLICKS_4_standardized, data=df, index=c("DMA", "WEEK"), model="within")
However, I got a little wrapped up if that is the correct step to finish this exercise off. Any insight here would be appreciated.

Related

Making a for loop in r

I am just getting started with R so I am sorry if I say things that dont make sense.
I am trying to make a for loop which does the following,
l_dtest[[1]]<-vector()
l_dtest[[2]]<-vector()
l_dtest[[3]]<-vector()
l_dtest[[4]]<-vector()
l_dtest[[5]]<-vector()
all the way up till any number which will be assigned as n. for example, if n was chosen to be 100 then it would repeat this all the way to > l_dtest[[100]]<-vector().
I have tried multiple different attempts at doing this and here is one of them.
n<-4
p<-(1:n)
l_dtest<-list()
for(i in p){
print((l_dtest[i]<-vector())<-i)
}
Again I am VERY new to R so I don't know what I am doing or what is wrong with this loop.
The detailed background for why I need to do this is that I need to write an R function that receives as input the size of the population "n", runs a simulation of the model below with that population size, and returns the number of generations it took to reach a MRCA (most recent common ancestor).
Here is the model,
We assume the population size is constant at n. Generations are discrete and non-overlapping. The genealogy is formed by this random process: in each
generation, each individual chooses two parents at random from the previous generation. The choices are made randomly and equally likely over the n possibilities and each individual chooses twice. All choices are made independently. Thus, for example, it is possible that, when an individual chooses his two parents, he chooses the same individual twice, so that in
fact he ends up with just one parent; this happens with probability 1/n.
I don't understand the specific step at the begining of this post or why I need to do it but my teacher said I do. I don't know if this helps but the next step is choosing parents for the first person and then combining the lists from the step I posted with a previous step. It looks like this,
sample(1:5, 2, replace=T)
#[1] 1 2
l_dtemp[[1]]<-union(l_dtemp[[1]], l_d[[1]]) #To my understanding, l_dtem[[1]] is now receiving the listdescandants from l_d[[1]] bcs the ladder chose l_dtemp[[1]] as first parent
l_dtemp[[2]]<-union(l_dtemp[[2]], l_d[[1]]) #Same as ^^ but for l_d[[1]]'s 2nd choice which is l_dtemp[[2]]
sample(1:5, 2, replace=T)
#[1] 1 3
l_dtemp[[1]]<-union(l_dtemp[[1]], l_d[[2]])
l_dtemp[[3]]<-union(l_dtemp[[3]], l_d[[2]])

From Stata to R: recoding bysort and xtreg

I'm very new to R and currently working on a replication project for a meta-research course at my university. The paper examines if having a in-home display to monitor energy consumption reduces the energy usage. I have already recoded 300 lines of code, but now I ran into a problem I could not yet solve.
The source code says: bysort id expdays: egen ave15 = mean(power) if hours0105==1
I do understand what this does, but I cannot replicate it in R. id is the identifier for the examined household and expdays denotes the current day of the experiment. So ave15 is the average power consumption from midnight to 6 am sorted for every household on each day. I figured out that (EIPbasedata is the complete dataset containing hourly data)
EIPbasedata$ave15[EIPbasedata$hours0105 == 1] <- ave(EIPbasedata$power, EIPbasedata$ID, EIPbasedata$ExpDays, FUN=mean)
would probably do the job, but this gives me a warning:
number of items to replace is not a multiple of replacement length
and the results are not right too. I do not have any idea what I could do to solve this.
The next thing I struggle to recode is:
xtreg ln_power0105 ihd0105 i.days0105 if exptime==4, fe vce(bootstrap, rep(200) seed(12345))
I think the right way would be using plm but I'm not sure how to implement the if condition (days0105 is a running variable for the number of the day in experiment and 0 if not between 0-6am, ihd0105 is a dummy for having an in-home display, exptime denotes 4 am in the morning- however I do not understand what exptime does here)
table4_1 <- plm(EIPbasedata$ln_power0105 ~ EIPbasedata$ihd0105, data=EIPbasedata, index = c("days0105"), model="within")
How do I compute the bootstrapped standard errors in plm?
I hope some expert can help me, since my R and Stata knowledge is not sufficient for this..
My lecturer provided the answer to me: at first i do specify a subsample which I call tmp_data here: tmp_data <- EIPbasedata[which(EIPbasedata$ExpTime == 4) , ]
Then I'm regressing the tmp_data with as.factor(days0105) values, which is the R equivalent to i.days0105
tmp_results <- plm(tmp_data$ln_power0105 ~ tmp_data$ihd0105 + as.factor(tmp_data$days0105), data = tmp_data, index = ("ID"), model = "within")
There are probably better and cleaner ways to do this, but I'm fine with it for now.

Optimizing dataset based on several conditions

I am trying to construct a (optimal) subset from a large dataset based on several conditions. I know that there are some possibilities to construct such a subset. See for example: this link. I tried this function but it is unsatisfactory since it takes to long to find such a subset and might be not "intelligent" enough. Below you can find some sample data
data <- data.table(id=rep(c("a","b","c","d","e","f"),3),
balance=c(1000,2000,1500,2000,4000,1500,
800,2000,1300,1800,2000,500,
700,1900,1100,1600,500,30),
rate=c(1100,1500,1000,700,300,200,
400,700,500,1300,1600,700,
800,1100,1200,700,400,150),
grade=c(70,100,90,50,150,40,
30,80,55,80,85,20,
35,70,55,75,15,10),
date= rep(c(2012,2013,2014),each=6))
data_agg <- aggregate(cbind(rate, grade) ~ date, data = data.frame(data),sum,na.rm=T)
data_agg$ratio <- data_agg$rate / data_agg$grade
> data_agg$ratio
[1] 9.60000 14.85714 16.73077
Now the objective is (e.g.) to minimize the increase in the data_agg$ratio over the years and at the same time include at least 3 id's in this subset.
By looking at the data we see e.g. dat ID == "e" has a ratio of 300/150=2 in 2012, 1600/85=19 in 2013 and 400/15=27 in 2014. The objective of my answer is to minimze the increase over the years, thus deleting "e" might have a desisarable effect on the subset.
datasubset <-subset(data, subset = id!=c("e"))
data_aggsubset <- aggregate(cbind(rate, grade) ~ date, data = data.frame(datasubset),sum,na.rm=T)
data_aggsubset$ratio <- data_aggsubset$rate / data_aggsubset$grade
data_aggsubset$ratio
[1] 12.85714 13.58491 16.12245
And indeed, the ratio is more stable over the years now. Thus my question is whether there is some optimizer function which seeks IDs such that this ratio is e.g. within a bandwidth of +/- 50% of the starting value (9.6 in this example) and contains at least three IDs. My original dataset is large, thus I am looking for a more intelligent function than the one I attached in the link. Please let me know if anything is unclear. Thank you in advance!

R - efficiently organize tables on condition over time

I'd like to know how to organize a data.frame into tables on conditions over time. I have a politics data set where certain organizations take a position on a bill and whether the bill passed or failed, over the last few decades.
I know how to organize the data individually into tables, but I do it one-by-one, and its really hard to see the trends. The stackoverflow community always seems to have ingenious ways of grouping data. Here's some mock data:
Data <- data.frame(
year = sample(1998:2004, 200, replace = TRUE),
outcome = sample(0:1, 200, replace = TRUE),
biz1 = sample(-2:2, 200, replace = TRUE),
biz2 = sample(-2:2, 200, replace = TRUE),
biz3 = sample(-2:2, 200, replace = TRUE)
)
In biz, a negative number means they oppose the outcome and a positive outcome means they support it. In outcome, a zero means the law did not pass, a 1 means that it did.
I would like to use tables to see how each business has become more or less successful over time, by looking at how their positive numbers match 1s and negative numbers match 0s, compared to ever other organization (and vice verse with positive matching the number of negative numbers).
A few notes
In the data set, I have about 100 businesses as columns, so I definitely need an efficient way to make the tables without naming every single column. I can select them in a range, like 125:300, since they are ordered together.
Of course i'm open to all ideas! Feel free to list any other ways of looking at this.
If i failed to ask this question right, please let me know how I could improve it.
The comments above about your question being too vague are right on target. Having said that this interests me and the vagueness leaves me free to interpret...
First, I'd recode the outcome as -1 if the bill fails. Then ourtcome * bizn is in a sense a success score for that business on that legislation: positive if either a bill that the business supported passed, or if a bill that the business opposed failed. Then there are several ways to visualize the scores. Here are just a few to get you started.
# re-code outcomes
Data$outcome <- ifelse(Data$outcome==0,-1,1)
library(reshape2) # for melt(...)
library(ggplot2)
gg <- melt(Data, id=c("year","outcome"),
variable.name="business", value.name="support")
gg$score <- with(gg,outcome*support) # score represents level of success
# mean success vs. year with +/- 1 sd
ggplot(gg,aes(x=year,y=score, color=business))+
stat_summary(fun.data="mean_sdl")+
stat_summary(fun.y=mean,geom="line")+
facet_grid(business~.)
# boxplot of success scores
ggplot(gg,aes(x=factor(year),y=score))+
geom_boxplot(aes(fill=business))+
facet_grid(business~.)
# barplot of success/failure frequencies
# excludes cases where a business did not take a position pro or con
gg.bar <- aggregate(score~year+business,gg,
function(eff)c(success=sum(eff>0),failure=sum(eff<0)))
gg.bar <- data.frame(gg.bar[1:2],gg.bar$score)
ggplot(gg.bar,aes(x=factor(year)))+
geom_bar(aes(y=success,fill="success"),stat="identity")+
geom_bar(aes(y=-failure,fill="failure"),stat="identity")+
geom_hline(xintercept=0,linetype=2,color="blue")+
scale_fill_discrete(name="",breaks=c("success","failure"))+
labs(x="",y="frequency")+
facet_grid(business~.)
All of these represent rather simplistic ways of looking at the data. If this was a serious project I would probably run a principal components analysis on the businesses to identify groups of businesses that tend to support or oppose the same legislation. Then I'd run a cluster analysis on the principal components to identify groups of legislation that tend to attract the support or opposition of groups of businesses.
Another way to approach this would be to run a logistic regression on the outcomes using the support/opposition of the various businesses as predictors. This would tell you which businesses tend to be more influential.

Summarized huge data, How to handle it with R?

I am working on EBS, Forex market Limit Order Book(LOB): here is an example of LOB in a 100 millisecond time slice:
datetime|side(0=Bid,1=Ask)| distance(1:best price, 2: 2nd best, etc.)| price
2008/01/28,09:11:28.000,0,1,1.6066
2008/01/28,09:11:28.000,0,2,1.6065
2008/01/28,09:11:28.000,0,3,1.6064
2008/01/28,09:11:28.000,0,4,1.6063
2008/01/28,09:11:28.000,0,5,1.6062
2008/01/28,09:11:28.000,1,1,1.6067
2008/01/28,09:11:28.000,1,2,1.6068
2008/01/28,09:11:28.000,1,3,1.6069
2008/01/28,09:11:28.000,1,4,1.6070
2008/01/28,09:11:28.000,1,5,1.6071
2008/01/28,09:11:28.500,0,1,1.6065 (I skip the rest)
To summarize the data, They have two rules(I have changed it a bit for simplicity):
If there is no change in LOB in Bid or Ask side, they will not record that side. Look at the last line of the data, millisecond was 000 and now is 500 which means there was no change at LOB in either side for 100, 200, 300 and 400 milliseconds(but those information are important for any calculation).
The last price (only the last) is removed from a given side of the order book. In this case, a single record with nothing in the price field. Again there will be no record for whole LOB at that time.
Example:2008/01/28,09:11:28.800,0,1,
I want to calculate minAsk-maxBid(1.6067-1.6066) or weighted average price (using sizes of all distances as weights, there is size column in my real data). I want to do for my whole data. But as you see the data has been summarized and this is not routine. I have written a code to produce the whole data (not just summary). This is fine for small data set but for a large one I am creating a huge file. I was wondering if you have any tips how to handle the data? How to fill the gaps while it is efficient.
You did not give a great reproducible example so this will be pseudo/untested code. Read the docs carefully and make adjustments as needed.
I'd suggest you first filter and split your data into two data.frames:
best.bid <- subset(data, side == 0 & distance == 1)
best.ask <- subset(data, side == 1 & distance == 1)
Then, for each of these two data.frames, use findInterval to compute the corresponding best ask or best bid:
best.bid$ask <- best.ask$price[findInterval(best.bid$time, best.ask$time)]
best.ask$bid <- best.bid$price[findInterval(best.ask$time, best.bid$time)]
(for this to work you might have to transform date/time into a linear measure, e.g. time in seconds since market opening.)
Then it should be easy:
min.spread <- min(c(best.bid$ask - best.bid$price,
best.ask$bid - best.ask$price))
I'm not sure I understand the end of day particularity but I bet you could just compute the spread at market close and add it to the final min call.
For the weighted average prices, use the same idea but instead of the two best.bid and best.ask data.frames, you should start with two weighted.avg.bid and weighted.avg.ask data.frames.

Resources