Compile all data produced by rolling regression into one - r

I am doing a rolling regression with a huge database, and the reference column used for rolling is called "Q" with the value from 5 to 45 for each data block. At first I tried with simple codes step by step, and it works very good:
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
#use the 20 Quarters data to do regression
model<-lm(fit,data=datapool[(which(datapool$Q>=5&datapool$Q<=24)),])
#use the model to forecast the value of next quarter
pre<-predict(model,newdata=datapool[which(datapool$Q==25),])
#get the forecast error
error<-datapool[which(datapool$Q==25),]$EB -pre
The result of the code above is:
> head(t(t(error)))
[,1]
21 0.006202145
62 -0.003005097
103 -0.019273856
144 -0.016053012
185 -0.025608022
226 -0.004548264
The datapool has the structure below:
> head(datapool)
X Q Firm EB EB1 EB2 EB3
1 1 5 CMCSA US Equity 0.02118966 0.08608825 0.01688180 0.01826571
2 2 6 CMCSA US Equity 0.02331379 0.10506550 0.02118966 0.01688180
3 3 7 CMCSA US Equity 0.01844747 0.12961955 0.02331379 0.02118966
4 4 8 CMCSA US Equity NA NA 0.01844747 0.02331379
5 5 9 CMCSA US Equity 0.01262287 0.05622834 NA 0.01844747
6 6 10 CMCSA US Equity 0.01495291 0.06059339 0.01262287 NA
...
Firm B(also from Q5 to Q45)
...
Firm C(also from Q5 to Q45)
The errors produced above are all marked with "X" value in "datapool", so I can know from which firm does the error come from.
Since I need to run the regression for 21 times (quarters 5-24,6-25,...,25-44), so I do not want to do it manully, and have thought out the following codes:
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
for (i in 0:20){
model<-lm(fit,data=datapool[(which(datapool$Q>=5+i&datapool$Q<=24+i)),])
pre<-predict(model,newdata=datapool[which(datapool$Q==25+i),])
error<-datapool[which(datapool$Q==25),]$EB -pre
}
The codes above works, and no error come out, but I do not know how to compile all errors produced by each regression into one datapool automatically? Can anyone help me with that?

(I say again: Really bad idea to use the name 'error' for a vector.) It is the name of a core function. This is how I would have attempted that task. (Using the subset parameter and indexing than the tortured which statements.
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
pre <- numeric(len=21)
errset <- numeric(len=21)
for (i in 0:20){
model<-lm(fit,data=datapool, subset= Q>=5+i & Q<=24+i )
pre[i]<-predict(model,newdata=datapool[ datapool[["Q"]] %in% i:(25+i), ])
errset[i]<-datapool[25+i,]$EB -pre
}
errset
No gaurantees this won't error out by running out tof data at the beginning or end since you have not offered either data or a comprehensive description of the data-object.

Related

Performing a 2 sample t test in R with replicates

I have a dataframe name R_alltemp in R with 6 columns, 2 groups of data with 3 replicates each. I'm trying to perform a t-test for each row between the first three values and the last three and use apply() so it can go through all the rows with one line. Here is the code im using so far.
R_alltemp$p.value<-apply(R_all3,1, function (x) t.test(x(R_alltemp[,1:3]), x(R_alltemp[,4:6]))$p.value)
and here is a snapshot of the table
R1.HCC827 R2.HCC827 R3.HCC827 R1.nci.h1975 R2.nci.h1975 R3.nci.h1975 p.value
1 13.587632 22.225083 15.074230 58.187465 79 82.287573 0.4391160
2 2.717526 1.778007 1.773439 1.763257 2 1.679338 0.4186339
3 203.814478 191.135711 232.320487 253.908939 263 263.656100 0.4904493
4 44.386264 45.339169 54.089884 3.526513 3 5.877684 0.3095634
it functions, but the p-values im getting just from eyeballing it seem wrong. For instance in the first line, the average of the first group is way lower than the second group, but my p value is only .4.
I feel like I'm missing something very obvious here, but I've been struggling with it for much longer than I'd like. Any help would be appreciated.
Your code is incorrect. I actually don't understand why it does not return an error. This part in particular: x(R_alltemp[,1:3]) should be x[1:3].
This should be your code:
R_alltemp$p.value2 <- apply(R_alltemp, 1, function(x) t.test(x[1:3], x[4:6])$p.value)
R1.HCC827 R2.HCC827 R3.HCC827 R1.nci.h1975 R2.nci.h1975 R3.nci.h1975 p.value p.value2
1 13.587632 22.225083 15.074230 58.187465 79 82.287573 0.4391160 0.010595829
2 2.717526 1.778007 1.773439 1.763257 2 1.679338 0.4186339 0.477533387
3 203.814478 191.135711 232.320487 253.908939 263 263.656100 0.4904493 0.044883436
4 44.386264 45.339169 54.089884 3.526513 3 5.877684 0.3095634 0.002853154
Remember that by specifying 1 it you are telling apply to get the columns. So function(x) returns the equivalent of this: x <- c(13.587632, 22.225083, 15.074230, 58.187465, 79, 82.287573) which means you want to subset the first three values by x[1:3] and then the last three x[4:6] and apply t.test to them.
A good idea before using apply is to test the function manually so if you do get odd results like these you know something went wrong with your code.
So the two-tailed p-value for the first row should be:
> g1 <- c(13.587632, 22.225083, 15.074230)
> g2 <- c(58.187465, 79, 82.287573)
> t.test(g1,g2)$p.value
[1] 0.01059583
Applying the function across all rows (I tacked the new p-val at the end as pval:
> tt$pval <- apply(tt,1,function(x) t.test(x[1:3],x[4:6])$p.value)
> tt
R1.HCC827 R2.HCC827 R3.HCC827 R1.nci.h1975 R2.nci.h1975 R3.nci.h1975 p.value pval
1 13.587632 22.225083 15.074230 58.187465 79 82.287573 0.4391160 0.010595829
2 2.717526 1.778007 1.773439 1.763257 2 1.679338 0.4186339 0.477533387
3 203.814478 191.135711 232.320487 253.908939 263 263.656100 0.4904493 0.044883436
4 44.386264 45.339169 54.089884 3.526513 3 5.877684 0.3095634 0.002853154
Maybe it's the double-use of the data frame name in the function (that you don't need)?

RFM analysis - using ddply in R. Missing column

I am trying to use the code mentioned for RFM modelling in R from the blog here. However, grouping the data frame into “Buy” and “No Buy” has not been explained clearly. As a result, when I try to execute the function getPercentages, I get error like:
object "Buy" not found.
I am trying to add a Buy column as follows:
df$Buy <- ifelse(df$Frequency > 1, 1, 0)
before executing the function.
I do not know if this is right way to get the values.
My head for df after getDataframe is
ID Date Amount Recency Frequency Monetary
1207779 2016-06-22 2112.00 8 20 1576.7725
2455590 2016-06-26 1064.00 4 16 1074.8400
2660337 2016-06-21 1870.00 9 20 1616.1700
257997 2016-06-22 616.00 8 22 684.8968
963883 2016-06-27 703.12 3 16 626.1125
1124489 2016-06-21 594.15 9 18 752.2011
Try this :
Buy<-rep(0,nrow(dftry))
dftry<-cbind(dftry,Buy)

R/Plotly: Error in list2env(data) : first argument must be a named list

I'm moderately experienced using R, but I'm just starting to learn to write functions to automate tasks. I'm currently working on a project to run sentiment analysis and topic models of speeches from the five remaining presidential candidates and have run into a snag.
I wrote a function to do a sentence-by-sentence analysis of positive and negative sentiments, giving each sentence a score. Miraculously, it worked and gave me a dataframe with scores for each sentence.
score text
1 1 iowa, thank you.
2 2 thanks to all of you here tonight for your patriotism, for your love of country and for doing what too few americans today are doing.
3 0 you are not standing on the sidelines complaining.
4 1 you are not turning your backs on the political process.
5 2 you are standing up and fighting back.
So what I'm trying to do now is create a function that takes the scores and figures out what percentage of the total is represented by the count of each score and then plot it using plotly. So here is the function I've written:
scoreFun <- function(x){{
tbl <- table(x)
res <- cbind(tbl,round(prop.table(tbl)*100,2))
colnames(res) <- c('Score', 'Count','Percentage')
return(res)
}
percent = data.frame(Score=rownames, Count=Count, Percentage=Percentage)
return(percent)
}
Which returns this:
saPct <- scoreFun(sanders.scores$score)
saPct
Count Percentage
-6 1 0.44
-5 1 0.44
-4 6 2.64
-3 13 5.73
-2 20 8.81
-1 42 18.50
0 72 31.72
1 34 14.98
2 18 7.93
3 9 3.96
4 6 2.64
5 2 0.88
6 1 0.44
9 1 0.44
11 1 0.44
What I had hoped it would return is a dataframe with what has ended up being the rownames as a variable called Score and the next two columns called Count and Percentage, respectively. Then I want to plot the Score on the x-axis and Percentage on the y-axis using this code:
d <- subplot(
plot_ly(clPct, x = rownames, y=Percentage, xaxis="x1", yaxis="y1"),
plot_ly(saPct, x = rownames, y=Percentage, xaxis="x2", yaxis="y2"),
margin = 0.05,
nrows=2
) %>% layout(d, xaxis=list(title="", range=c(-15, 15)),
xaxis2=list(title="Score", range=c(-15,15)),
yaxis=list(title="Clinton", range=c(0,50)),
yaxis2=list(title="Sanders", range=c(0,50)),showlegend = FALSE)
d
I'm pretty certain I've made some obvious mistakes in my function and my plot_ly code, because clearly it's not returning the dataframe I want and is leading to the error Error in list2env(data) : first argument must be a named list when I run the `plotly code. Again, though, I'm not very experienced writing functions and I've not found a similar issue when I Google, so I don't know how to fix this.
Any advice would be most welcome. Thanks!
#MLavoie, this code from the question I referenced in my comment did the trick. Many thanks!
scoreFun <- function(x){
tbl <- data.frame(table(x))
colnames(tbl) <- c("Score", "Count")
tbl$Percentage <- tbl$Count / sum(tbl$Count) * 100
return(tbl)
}

Regression Loop by category

I have a data set that has multiple engines and I want to create a for loop function to run a linear regression for each engine and to extract the coefficients of each regression. So I want the first regression to run across the six weeks with engine= Google and then the second one to run across the six weeks with engine =Bing, etc. A sample of the data set looks like this:
Engine Wk Imp Clicks lnSpend Actions CPA
google 1 100302 15791 10998 31 354.79
google 2 23893 4734 2866 16 179.18
google 3 318 16 37.83 11 3.44
google 4 7992 1980 1704.81 27 63.14
google 5 13206 3292 2732.13 26 105.08
google 6 10888 2966 2293.86 22 104.27
bing 1 23536 1808 1028.95 3 342.98
bing 2 86873 7196 2740.28 14 195.73
bing 3 54654 4398 1786.96 13 137.46
bing 4 45553 3353 1860.47 13 143.11
bing 5 41254 3322 1811.80 13 139.37
bing 6 38305 3117 1501.01 19 79.00
The regression equation is actions~ spend and this would remain constant across all the engines.
This is the code that I have so far:
for(i in unique(mydata$engine))
{
reg<- append(reg, lm(mydata$Actions~ mydata$lnspend, data=mydata[mydata$engine== i,]))
}
summary(reg)
However, when I do that, the regression runs on the full data set combining all of the engines together.
I also tried using a by function. The code that I have for that is
reg<- by(mydata$engine, function(mydata) lm(Actions~ lnspend, data=mydata))
sapply(reg, coef)
When I run that I get the following error:
"Error in unique.default(x, nmax = nmax)"
Any idea how to fix it?
Using data.table:
library(data.table)
mydata=data.table(mydata)
mydata[,as.list(lm(Actions~lnSpend)$coeff),by=Engine]
Engine (Intercept) lnSpend
1: google 17.632263 0.001318611
2: bing 4.735979 0.004341699
You can also make it work by adjusting your first for loop a bit:
Engine <- c('g','g','g','g','g','g','b','b','b','b','b','b')
Actions <- c(31,16,11,27,26,22,3,14,13,13,13,19)
lnSpend <- c(10998,2866,37.83,1704.81,2732,2293,1028,2740,1786,1860,1811,1501)
df <- data.frame(Engine,Actions,lnSpend)
reg <- c()
for (eng in unique(Engine)){
m <- lm(Actions~ lnSpend, data = df[which(df$Engine == eng),])
reg <- append(reg, m$coeff)
}
reg
# > reg
# (Intercept) lnSpend (Intercept) lnSpend
# 17.632629162 0.001318568 4.734059476 0.004344177
Your for/loop should work. You just did not filter your dependent and independent variables in for/loop by engine type, so R takes the full dataset:
Consider either explicitly referencing the subset filter in each variable:
for(i in unique(mydata$engine))
{
reg<- append(reg, lm(mydata$Actions[mydata$engine== i] ~ mydata$lnspend[mydata$engine== i],
data=mydata[mydata$engine== i,]))
}
Or leave anonymously so data argument dictates the structure:
for(i in unique(mydata$engine))
{
reg<- append(reg, lm(Actions ~ lnspend,
data=mydata[mydata$engine== i,]))
}
summary(reg)

Add a column to a database with matching values from another database r

Sorry that my question is a little vague. I have two separated data bases (data1 as the first database and data2 as the second one) as follows:
Area Yr AllRev Totalcalls
A 2012 1021597.78 835
B 2013 1002968.21 833
c 2014 730345.93 65
d 2015 251956.26 232
e 2012 22408.71 25
...
Data 2:
Yr TotRev TotCalls
2012 160038596.0 131064
2013 399750664.0 312651
...
Now I want to add a column "RevPercent" to data 1 which is going to calculate the following value for each row:
100*data1$AllRev/data2$TotRev
However, if yr ==2012 for data1, I want it to read TotRev for 2012 from data2 to calculate the aformentioned value. I wrote the following line of code but I definitely am getting an error:
data1 <- cbind(data1,100*round(data1[,3]/data2[data2[,1]==data2[,2],2],4))
And the error is as follows:
In data2[, 1] == data2[,2] :
longer object length is not a multiple of shorter object leng
Any help is appreciated.
Thanks

Resources