From Stata to R: recoding bysort and xtreg - r

I'm very new to R and currently working on a replication project for a meta-research course at my university. The paper examines if having a in-home display to monitor energy consumption reduces the energy usage. I have already recoded 300 lines of code, but now I ran into a problem I could not yet solve.
The source code says: bysort id expdays: egen ave15 = mean(power) if hours0105==1
I do understand what this does, but I cannot replicate it in R. id is the identifier for the examined household and expdays denotes the current day of the experiment. So ave15 is the average power consumption from midnight to 6 am sorted for every household on each day. I figured out that (EIPbasedata is the complete dataset containing hourly data)
EIPbasedata$ave15[EIPbasedata$hours0105 == 1] <- ave(EIPbasedata$power, EIPbasedata$ID, EIPbasedata$ExpDays, FUN=mean)
would probably do the job, but this gives me a warning:
number of items to replace is not a multiple of replacement length
and the results are not right too. I do not have any idea what I could do to solve this.
The next thing I struggle to recode is:
xtreg ln_power0105 ihd0105 i.days0105 if exptime==4, fe vce(bootstrap, rep(200) seed(12345))
I think the right way would be using plm but I'm not sure how to implement the if condition (days0105 is a running variable for the number of the day in experiment and 0 if not between 0-6am, ihd0105 is a dummy for having an in-home display, exptime denotes 4 am in the morning- however I do not understand what exptime does here)
table4_1 <- plm(EIPbasedata$ln_power0105 ~ EIPbasedata$ihd0105, data=EIPbasedata, index = c("days0105"), model="within")
How do I compute the bootstrapped standard errors in plm?
I hope some expert can help me, since my R and Stata knowledge is not sufficient for this..

My lecturer provided the answer to me: at first i do specify a subsample which I call tmp_data here: tmp_data <- EIPbasedata[which(EIPbasedata$ExpTime == 4) , ]
Then I'm regressing the tmp_data with as.factor(days0105) values, which is the R equivalent to i.days0105
tmp_results <- plm(tmp_data$ln_power0105 ~ tmp_data$ihd0105 + as.factor(tmp_data$days0105), data = tmp_data, index = ("ID"), model = "within")
There are probably better and cleaner ways to do this, but I'm fine with it for now.

Related

PLM in r, variable standardization and lag

My goal is to understand whether clicks (say, on a website) or calls (say, over the phone to new contacts) have a greater impact on new sign-ups over a 90 week period in the USA. I thought this problem lent itself well to PLM.
My data consist of 210 DMA regions (DMA), 90 weeks (WEEK), new sign-ups (NEW) (outcome) and two predictors (CALLS and CLICKS), so for example:
plm_model <- plm(NEW ~ CALLS + CLICKS, data=df, index=c("DMA", "WEEK"), model="within")
First, does one standardize the predictors and outcome across all panels (so in total) or within each panel (as in, uniquely standardizing to each DMA)? The models differ when I did both, which it should, but I can't quite find any documentation on which is "more correct" or why one would do one or the other.
Second, when I look at the datafile across all DMA regions but by time, there is a lag of 4 periods on CLICKS x NEW but 20 period on CALLS x NEW. I have arrived at those using:
which.max(sapply(1:50, function(i) cor(df$NEW, lag(df$CLICKS, i), use = "complete")))
which.max(sapply(1:50, function(i) cor(df$NEW, lag(df$CALLS, i), use = "complete")))
This does make sense for the data-- there is more "NEW" sooner after "CLICKS" but "CALLS" take more time to pay off.
I thought the logical next step would be lag according to the correlation and then standardize based on the data available now (as in, fewer observations now), and then I was stuck if within or across panels (as mentioned above), and then I was going to do something like:
plm_model <- plm(NEW ~ CALLS_20_standardized + CLICKS_4_standardized, data=df, index=c("DMA", "WEEK"), model="within")
However, I got a little wrapped up if that is the correct step to finish this exercise off. Any insight here would be appreciated.

"grouping factor must have exactly 2 levels"

Hi y'all I'm fairly new to R and I'm supposed to calculate F statistic for this table
The code I have inputted is as follows:
# F-test
res.ftest <- var.test(TotalLength ~ SwimSpeed , data = my_data)
res.ftest
I know I have more than two levels from the other posts I have read online, but I am not sure what to change to get the outcome I want.
FIRST AND FOREMOST...If you invoke
?var.test()
you will note that the S3 version you called assumes lhs is numeric and rhs is a 2-level factor.
As for the rest, while I don't know the words to your specific work/school assignment here, the words shouldn't be "calculate an F-test", exactly. They should be "analyze these data appropriately". While there are a number of routes you could take, this is normally seen as a regression problem, NOT a problem of trying to compare two variances/complete a 1-way ANOVA which is what var.test() is designed to do. (Reading the documentation at, for example, https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/var.test should make this clear and is something you should always do when invoking R procedures.)
Using a subset of your data (please do this yourself for stack helpers next time rather than make someone here do it for you)...
df <- data.frame(
ID = 1:4,
TL = c(27.1,29.0,33.0,29.3),
SS = c(86.6,62.4,63.8,62.3)
)
cor.test(df$TL,df$SS) # reports t statistic
# or
summary(lm(df$TL ~ df$SS)) # reports F statistic
Note that F is simply t^2 here in the 2 variable case.
Lastly, I should add it is remotely, vaguely possible the assignment is to check if the variances of the 2 distributions are equal even though I can see no reason why anyone would want to know considering they are 2 different measures on two different underlying scales measuring 2 different things. However,
var.test(df$TL, df$SS)
will return a "result" should you take the assignment to mean compare the observed variances.

In Surv(start_time, end_time, new_death) : Stop time must be > start time, NA created

I am using the package "survival" to fit a cox model with time intervals (intervals are 30 days long). I am reading the data in from an xlsx worksheet. I keep getting the error that says my stop time must be greater than my start time. The start values are all smaller than the stop values.
I checked to make sure these are being read in as numbers which they are. I also changed them to integers which did not solve the problem. I used this code to see if any observations met this criterion:
a <- a1[which(a1$end_time > a1$start_time),]
About half the dataset meets this criterion, but when I look at the data all the start times appear to be less than the end times.
Does anyone know why this is happening and how I can fix it? I am an R newbie so perhaps there is something obvious I don't know about?
model1<- survfit(Surv(start_time, end_time, censor) ~ exp, data=a1, weights = weight)
enter image description here

Optimizing dataset based on several conditions

I am trying to construct a (optimal) subset from a large dataset based on several conditions. I know that there are some possibilities to construct such a subset. See for example: this link. I tried this function but it is unsatisfactory since it takes to long to find such a subset and might be not "intelligent" enough. Below you can find some sample data
data <- data.table(id=rep(c("a","b","c","d","e","f"),3),
balance=c(1000,2000,1500,2000,4000,1500,
800,2000,1300,1800,2000,500,
700,1900,1100,1600,500,30),
rate=c(1100,1500,1000,700,300,200,
400,700,500,1300,1600,700,
800,1100,1200,700,400,150),
grade=c(70,100,90,50,150,40,
30,80,55,80,85,20,
35,70,55,75,15,10),
date= rep(c(2012,2013,2014),each=6))
data_agg <- aggregate(cbind(rate, grade) ~ date, data = data.frame(data),sum,na.rm=T)
data_agg$ratio <- data_agg$rate / data_agg$grade
> data_agg$ratio
[1] 9.60000 14.85714 16.73077
Now the objective is (e.g.) to minimize the increase in the data_agg$ratio over the years and at the same time include at least 3 id's in this subset.
By looking at the data we see e.g. dat ID == "e" has a ratio of 300/150=2 in 2012, 1600/85=19 in 2013 and 400/15=27 in 2014. The objective of my answer is to minimze the increase over the years, thus deleting "e" might have a desisarable effect on the subset.
datasubset <-subset(data, subset = id!=c("e"))
data_aggsubset <- aggregate(cbind(rate, grade) ~ date, data = data.frame(datasubset),sum,na.rm=T)
data_aggsubset$ratio <- data_aggsubset$rate / data_aggsubset$grade
data_aggsubset$ratio
[1] 12.85714 13.58491 16.12245
And indeed, the ratio is more stable over the years now. Thus my question is whether there is some optimizer function which seeks IDs such that this ratio is e.g. within a bandwidth of +/- 50% of the starting value (9.6 in this example) and contains at least three IDs. My original dataset is large, thus I am looking for a more intelligent function than the one I attached in the link. Please let me know if anything is unclear. Thank you in advance!

R - efficiently organize tables on condition over time

I'd like to know how to organize a data.frame into tables on conditions over time. I have a politics data set where certain organizations take a position on a bill and whether the bill passed or failed, over the last few decades.
I know how to organize the data individually into tables, but I do it one-by-one, and its really hard to see the trends. The stackoverflow community always seems to have ingenious ways of grouping data. Here's some mock data:
Data <- data.frame(
year = sample(1998:2004, 200, replace = TRUE),
outcome = sample(0:1, 200, replace = TRUE),
biz1 = sample(-2:2, 200, replace = TRUE),
biz2 = sample(-2:2, 200, replace = TRUE),
biz3 = sample(-2:2, 200, replace = TRUE)
)
In biz, a negative number means they oppose the outcome and a positive outcome means they support it. In outcome, a zero means the law did not pass, a 1 means that it did.
I would like to use tables to see how each business has become more or less successful over time, by looking at how their positive numbers match 1s and negative numbers match 0s, compared to ever other organization (and vice verse with positive matching the number of negative numbers).
A few notes
In the data set, I have about 100 businesses as columns, so I definitely need an efficient way to make the tables without naming every single column. I can select them in a range, like 125:300, since they are ordered together.
Of course i'm open to all ideas! Feel free to list any other ways of looking at this.
If i failed to ask this question right, please let me know how I could improve it.
The comments above about your question being too vague are right on target. Having said that this interests me and the vagueness leaves me free to interpret...
First, I'd recode the outcome as -1 if the bill fails. Then ourtcome * bizn is in a sense a success score for that business on that legislation: positive if either a bill that the business supported passed, or if a bill that the business opposed failed. Then there are several ways to visualize the scores. Here are just a few to get you started.
# re-code outcomes
Data$outcome <- ifelse(Data$outcome==0,-1,1)
library(reshape2) # for melt(...)
library(ggplot2)
gg <- melt(Data, id=c("year","outcome"),
variable.name="business", value.name="support")
gg$score <- with(gg,outcome*support) # score represents level of success
# mean success vs. year with +/- 1 sd
ggplot(gg,aes(x=year,y=score, color=business))+
stat_summary(fun.data="mean_sdl")+
stat_summary(fun.y=mean,geom="line")+
facet_grid(business~.)
# boxplot of success scores
ggplot(gg,aes(x=factor(year),y=score))+
geom_boxplot(aes(fill=business))+
facet_grid(business~.)
# barplot of success/failure frequencies
# excludes cases where a business did not take a position pro or con
gg.bar <- aggregate(score~year+business,gg,
function(eff)c(success=sum(eff>0),failure=sum(eff<0)))
gg.bar <- data.frame(gg.bar[1:2],gg.bar$score)
ggplot(gg.bar,aes(x=factor(year)))+
geom_bar(aes(y=success,fill="success"),stat="identity")+
geom_bar(aes(y=-failure,fill="failure"),stat="identity")+
geom_hline(xintercept=0,linetype=2,color="blue")+
scale_fill_discrete(name="",breaks=c("success","failure"))+
labs(x="",y="frequency")+
facet_grid(business~.)
All of these represent rather simplistic ways of looking at the data. If this was a serious project I would probably run a principal components analysis on the businesses to identify groups of businesses that tend to support or oppose the same legislation. Then I'd run a cluster analysis on the principal components to identify groups of legislation that tend to attract the support or opposition of groups of businesses.
Another way to approach this would be to run a logistic regression on the outcomes using the support/opposition of the various businesses as predictors. This would tell you which businesses tend to be more influential.

Resources