Regression Loop by category - r

I have a data set that has multiple engines and I want to create a for loop function to run a linear regression for each engine and to extract the coefficients of each regression. So I want the first regression to run across the six weeks with engine= Google and then the second one to run across the six weeks with engine =Bing, etc. A sample of the data set looks like this:
Engine Wk Imp Clicks lnSpend Actions CPA
google 1 100302 15791 10998 31 354.79
google 2 23893 4734 2866 16 179.18
google 3 318 16 37.83 11 3.44
google 4 7992 1980 1704.81 27 63.14
google 5 13206 3292 2732.13 26 105.08
google 6 10888 2966 2293.86 22 104.27
bing 1 23536 1808 1028.95 3 342.98
bing 2 86873 7196 2740.28 14 195.73
bing 3 54654 4398 1786.96 13 137.46
bing 4 45553 3353 1860.47 13 143.11
bing 5 41254 3322 1811.80 13 139.37
bing 6 38305 3117 1501.01 19 79.00
The regression equation is actions~ spend and this would remain constant across all the engines.
This is the code that I have so far:
for(i in unique(mydata$engine))
{
reg<- append(reg, lm(mydata$Actions~ mydata$lnspend, data=mydata[mydata$engine== i,]))
}
summary(reg)
However, when I do that, the regression runs on the full data set combining all of the engines together.
I also tried using a by function. The code that I have for that is
reg<- by(mydata$engine, function(mydata) lm(Actions~ lnspend, data=mydata))
sapply(reg, coef)
When I run that I get the following error:
"Error in unique.default(x, nmax = nmax)"
Any idea how to fix it?

Using data.table:
library(data.table)
mydata=data.table(mydata)
mydata[,as.list(lm(Actions~lnSpend)$coeff),by=Engine]
Engine (Intercept) lnSpend
1: google 17.632263 0.001318611
2: bing 4.735979 0.004341699

You can also make it work by adjusting your first for loop a bit:
Engine <- c('g','g','g','g','g','g','b','b','b','b','b','b')
Actions <- c(31,16,11,27,26,22,3,14,13,13,13,19)
lnSpend <- c(10998,2866,37.83,1704.81,2732,2293,1028,2740,1786,1860,1811,1501)
df <- data.frame(Engine,Actions,lnSpend)
reg <- c()
for (eng in unique(Engine)){
m <- lm(Actions~ lnSpend, data = df[which(df$Engine == eng),])
reg <- append(reg, m$coeff)
}
reg
# > reg
# (Intercept) lnSpend (Intercept) lnSpend
# 17.632629162 0.001318568 4.734059476 0.004344177

Your for/loop should work. You just did not filter your dependent and independent variables in for/loop by engine type, so R takes the full dataset:
Consider either explicitly referencing the subset filter in each variable:
for(i in unique(mydata$engine))
{
reg<- append(reg, lm(mydata$Actions[mydata$engine== i] ~ mydata$lnspend[mydata$engine== i],
data=mydata[mydata$engine== i,]))
}
Or leave anonymously so data argument dictates the structure:
for(i in unique(mydata$engine))
{
reg<- append(reg, lm(Actions ~ lnspend,
data=mydata[mydata$engine== i,]))
}
summary(reg)

Related

Can we extract an output text from a function in R?

I have simulation data that it repeated 100 times. I applied a mclustBIC for each sample.
Then, I would like to access the top result of this function. However, I could not access it.
I provided an example of this function.
library(mclust)
mclustBIC(iris[,-5])
The output is:
Bayesian Information Criterion (BIC):
EII VII EEI VEI EVI VVI EEE VEE EVE VVE EEV
1 -1804.0854 -1804.0854 -1522.1202 -1522.1202 -1522.1202 -1522.1202 -829.9782 -829.9782 -829.9782 -829.9782 -829.9782
2 -1123.4117 -1012.2352 -1042.9679 -956.2823 -1007.3082 -857.5515 -688.0972 -656.3270 -657.2263 -605.1841 -644.5997
3 -878.7650 -853.8144 -813.0504 -779.1566 -797.8342 -744.6382 -632.9647 -605.3982 -666.5491 -636.4259 -644.7810
4 -893.6140 -812.6048 -827.4036 -748.4529 -837.5452 -751.0198 -646.0258 -604.8371 -705.5435 -639.7078 -699.8684
5 -782.6441 -742.6083 -741.9185 -688.3463 -766.8158 -711.4502 -604.8131 NA -723.7199 -632.2056 -652.2959
6 -715.7136 -705.7811 -693.7908 -676.1697 -774.0673 -707.2901 -609.8543 -609.5584 -661.9497 -664.8224 -664.4537
7 -731.8821 -698.5413 -713.1823 -680.7377 -813.5220 -766.6500 -632.4947 NA -699.5102 -690.6108 -709.9530
8 -725.0805 -701.4806 -691.4133 -679.4640 -740.4068 -764.1969 -639.2640 -654.8237 -700.4277 -709.9392 -735.4463
9 -694.5205 -700.0276 -696.2607 -702.0143 -767.8044 -755.8290 -653.0878 NA -729.6651 -734.2997 -758.9348
VEV EVV VVV
1 -829.9782 -829.9782 -829.9782
2 -561.7285 -658.3306 -574.0178
3 -562.5522 -656.0359 -580.8396
4 -602.0104 -725.2925 -630.6000
5 -634.2890 NA -676.6061
6 -679.5116 NA -754.7938
7 -704.7699 -809.8276 -806.9277
8 -712.8788 -831.7520 -830.6373
9 -748.8237 -882.4391 -883.6931
Top 3 models based on the BIC criterion:
VEV,2 VEV,3 VVV,2
-561.7285 -562.5522 -574.0178
I want to access the last line and extract values from it (is that possible?)
Top 3 models based on the BIC criterion:
VEV,2 VEV,3 VVV,2
-561.7285 -562.5522 -574.0178
update: using summary() will help to get to this value, but not to extract from it
I tried to solve this point using another way. I first extract only the values, such that:
res <- mclustBIC(iris[,-5])
res1 <- as.data.frame(res[,1:14])
res2 <- max(res1[[1]])
However, res2 will provide me with the maximum value for a specific model. In addition, I need to know the number of clusters (from 1 to 9). I would like to have it like this:
"EII, 9, -694.5205". ## the last line of EII.
A possible solution:
library(mclust)
m <- mclustBIC(iris[,-5])
BIC <- as.numeric(summary(m))
names(BIC) <- names(summary(m))
BIC
#> VEV,2 VEV,3 VVV,2
#> -561.7285 -562.5522 -574.0178

Creating a loop with compare_means

I am trying to create a loop to use compare_means (ggpubr library in R) across all columns in a dataframe and then select only significant p.adjusted values, but it does not work well.
Here is some code
head(df3)
sampleID Actio Beta Gammes Traw Cluster2
gut10 10 2.2 55 13 HIGH
gut12 20 44 67 12 HIGH
gut34 5.5 3 89 33 LOW
gut26 4 45 23 4 LOW
library(ggpubr)
data<-list()
for (i in 2:length(df3)){
data<-compare_means(df3[[i]] ~ Cluster2, data=df3, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}
Error: `df3[i]` must evaluate to column positions or names, not a list
I would like to create an output to convert in dataframe with all the information contained in compare_means output
Thanks a lot
Try this:
library(ggpubr)
data<-list()
for (i in 2:(length(df3)-1)){
new<-df3[,c(i,"Cluster2")]
colnames(new)<-c("interest","Cluster2")
data<-compare_means(interest ~ Cluster2, data=new, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}

R/Plotly: Error in list2env(data) : first argument must be a named list

I'm moderately experienced using R, but I'm just starting to learn to write functions to automate tasks. I'm currently working on a project to run sentiment analysis and topic models of speeches from the five remaining presidential candidates and have run into a snag.
I wrote a function to do a sentence-by-sentence analysis of positive and negative sentiments, giving each sentence a score. Miraculously, it worked and gave me a dataframe with scores for each sentence.
score text
1 1 iowa, thank you.
2 2 thanks to all of you here tonight for your patriotism, for your love of country and for doing what too few americans today are doing.
3 0 you are not standing on the sidelines complaining.
4 1 you are not turning your backs on the political process.
5 2 you are standing up and fighting back.
So what I'm trying to do now is create a function that takes the scores and figures out what percentage of the total is represented by the count of each score and then plot it using plotly. So here is the function I've written:
scoreFun <- function(x){{
tbl <- table(x)
res <- cbind(tbl,round(prop.table(tbl)*100,2))
colnames(res) <- c('Score', 'Count','Percentage')
return(res)
}
percent = data.frame(Score=rownames, Count=Count, Percentage=Percentage)
return(percent)
}
Which returns this:
saPct <- scoreFun(sanders.scores$score)
saPct
Count Percentage
-6 1 0.44
-5 1 0.44
-4 6 2.64
-3 13 5.73
-2 20 8.81
-1 42 18.50
0 72 31.72
1 34 14.98
2 18 7.93
3 9 3.96
4 6 2.64
5 2 0.88
6 1 0.44
9 1 0.44
11 1 0.44
What I had hoped it would return is a dataframe with what has ended up being the rownames as a variable called Score and the next two columns called Count and Percentage, respectively. Then I want to plot the Score on the x-axis and Percentage on the y-axis using this code:
d <- subplot(
plot_ly(clPct, x = rownames, y=Percentage, xaxis="x1", yaxis="y1"),
plot_ly(saPct, x = rownames, y=Percentage, xaxis="x2", yaxis="y2"),
margin = 0.05,
nrows=2
) %>% layout(d, xaxis=list(title="", range=c(-15, 15)),
xaxis2=list(title="Score", range=c(-15,15)),
yaxis=list(title="Clinton", range=c(0,50)),
yaxis2=list(title="Sanders", range=c(0,50)),showlegend = FALSE)
d
I'm pretty certain I've made some obvious mistakes in my function and my plot_ly code, because clearly it's not returning the dataframe I want and is leading to the error Error in list2env(data) : first argument must be a named list when I run the `plotly code. Again, though, I'm not very experienced writing functions and I've not found a similar issue when I Google, so I don't know how to fix this.
Any advice would be most welcome. Thanks!
#MLavoie, this code from the question I referenced in my comment did the trick. Many thanks!
scoreFun <- function(x){
tbl <- data.frame(table(x))
colnames(tbl) <- c("Score", "Count")
tbl$Percentage <- tbl$Count / sum(tbl$Count) * 100
return(tbl)
}

Class probabilities in Neural networks

I use the caret package with multi-layer perception.
My dataset consists of a labelled output value, which can be either A,B or C. The input vector consists of 4 variables.
I use the following lines of code to calculate the class probabilities for each input value:
fit <- train(device~.,data=dataframetrain[1:100,], method="mlp",
trControl=trainControl(classProbs=TRUE))
(p=(predict(fit,newdata=dataframetest,type=("prob"))))
I thought that the class probabilities for each record must sum up to one. But I get the following:
rowSums(p)
# 1 2 3 4 5 6 7 8
# 1.015291 1.015265 1.015291 1.015291 1.015291 1.014933 1.015011 1.015291
# 9 10 11 12 13 14 15 16
# 1.014933 1.015206 1.015291 1.015291 1.015291 1.015224 1.015011 1.015291
Can anybody help me because I don't know what I did wrong.
There's probably nothing wrong, it just seems that caret returns the values of the neurons in the output layer without converting them to probabilities (correct me if I'm wrong). When using the RSNNS::mlp function outside of caret the rows of the predictions also don't sum to one.
Since all output neurons have the same activation function the outputs can be converted to probabilities by dividing the predictions by the respective row sum, see this question.
This behavior seems to be true when using method = "mlp" or method = "mlpWeightDecay" but when using method = "nnet" the predictions do sum to one.
Example:
library(RSNNS)
data(iris)
#shuffle the vector
iris <- iris[sample(1:nrow(iris),length(1:nrow(iris))),1:ncol(iris)]
irisValues <- iris[,1:4]
irisTargets <- iris[,5]
irisTargetsDecoded <- decodeClassLabels(irisTargets)
iris2 <- splitForTrainingAndTest(irisValues, irisTargetsDecoded, ratio=0.15)
iris2 <- normTrainingAndTestSet(iris2)
set.seed(432)
model <- mlp(iris2$inputsTrain, iris2$targetsTrain,
size=5, learnFuncParams=c(0.1), maxit=50,
inputsTest=iris2$inputsTest, targetsTest=iris2$targetsTest)
predictions <- predict(model,iris2$inputsTest)
head(rowSums(predictions))
# 139 26 17 104 54 82
# 1.0227419 1.0770722 1.0642565 1.0764587 0.9952268 0.9988647
probs <- predictions / rowSums(predictions)
head(rowSums(probs))
# 139 26 17 104 54 82
# 1 1 1 1 1 1
# nnet example --------------------------------------
library(caret)
training <- sample(seq_along(irisTargets), size = 100, replace = F)
modelCaret <- train(y = irisTargets[training],
x = irisValues[training, ],
method = "nnet")
predictionsCaret <- predict(modelCaret,
newdata = irisValues[-training, ],
type = "prob")
head(rowSums(predictionsCaret))
# 122 100 89 134 30 86
# 1 1 1 1 1 1
I don't know how much flexibility the caret package offers in these choices, but the standard way to make a neural net produce outputs which sum to one is to use the softmax function as the activation function in the output layer.

Compile all data produced by rolling regression into one

I am doing a rolling regression with a huge database, and the reference column used for rolling is called "Q" with the value from 5 to 45 for each data block. At first I tried with simple codes step by step, and it works very good:
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
#use the 20 Quarters data to do regression
model<-lm(fit,data=datapool[(which(datapool$Q>=5&datapool$Q<=24)),])
#use the model to forecast the value of next quarter
pre<-predict(model,newdata=datapool[which(datapool$Q==25),])
#get the forecast error
error<-datapool[which(datapool$Q==25),]$EB -pre
The result of the code above is:
> head(t(t(error)))
[,1]
21 0.006202145
62 -0.003005097
103 -0.019273856
144 -0.016053012
185 -0.025608022
226 -0.004548264
The datapool has the structure below:
> head(datapool)
X Q Firm EB EB1 EB2 EB3
1 1 5 CMCSA US Equity 0.02118966 0.08608825 0.01688180 0.01826571
2 2 6 CMCSA US Equity 0.02331379 0.10506550 0.02118966 0.01688180
3 3 7 CMCSA US Equity 0.01844747 0.12961955 0.02331379 0.02118966
4 4 8 CMCSA US Equity NA NA 0.01844747 0.02331379
5 5 9 CMCSA US Equity 0.01262287 0.05622834 NA 0.01844747
6 6 10 CMCSA US Equity 0.01495291 0.06059339 0.01262287 NA
...
Firm B(also from Q5 to Q45)
...
Firm C(also from Q5 to Q45)
The errors produced above are all marked with "X" value in "datapool", so I can know from which firm does the error come from.
Since I need to run the regression for 21 times (quarters 5-24,6-25,...,25-44), so I do not want to do it manully, and have thought out the following codes:
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
for (i in 0:20){
model<-lm(fit,data=datapool[(which(datapool$Q>=5+i&datapool$Q<=24+i)),])
pre<-predict(model,newdata=datapool[which(datapool$Q==25+i),])
error<-datapool[which(datapool$Q==25),]$EB -pre
}
The codes above works, and no error come out, but I do not know how to compile all errors produced by each regression into one datapool automatically? Can anyone help me with that?
(I say again: Really bad idea to use the name 'error' for a vector.) It is the name of a core function. This is how I would have attempted that task. (Using the subset parameter and indexing than the tortured which statements.
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
pre <- numeric(len=21)
errset <- numeric(len=21)
for (i in 0:20){
model<-lm(fit,data=datapool, subset= Q>=5+i & Q<=24+i )
pre[i]<-predict(model,newdata=datapool[ datapool[["Q"]] %in% i:(25+i), ])
errset[i]<-datapool[25+i,]$EB -pre
}
errset
No gaurantees this won't error out by running out tof data at the beginning or end since you have not offered either data or a comprehensive description of the data-object.

Resources