Multi-factor linear regression - r

df <- data.frame(
num =
c(5, 7, 3,
4, 2, 6,
5, 3, 6,
5, 6, 0,
7, 4, 0,
7, 7, 0,
6, 6, 0,
4, 6, 1,
6, 4, 0,
7, 7, 0,
2, 4, 0,
5, 7, 4,
7, 5, 0,
4, 5, 0,
6, 6, 3
),
x1 = factor(rep(c("xx", "pp", "tru"), 15)),
x2 = factor(rep(c("A", "B", "C"), 15)),
x3 = factor(rep(1:15, rep(3, 15))))
I would like to calculate significance for:
x1
x2
x3
interaction x1/x2
interaction x1/x3
interaction x2/x3
interaction x1/x2/x3
I think I have to do a linear model lm so I have tried
lm(df[,"num"] ~ df[,"x1"] * df[,"x2"] * df[,"x3"])
I am not sure if this is correct.

The rule of thumb is to fit a linear model then perform an ANOVA:
fit <- lm(num ~ x1 * x2 * x3, data = df)
anova(fit)
However, your provided toy example is really a bad one, so nothing interesting will be seen.
You have x1 and x2 identical (so they have perfect nesting). In this regard, you will get lots of NA coefficients;
You have no replication. For each factor combination you only have one observation, so you will end up with an exact fit with zero residuals.

Related

How do i Interpret the coefficients of glm with binomial error distribution?

I would be happy if someone could help me understand glm with binominal error distribution.
Lets assume the following df:
year<-c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3,3, 3, 3, 3, 3, 3, 3, 3)
success<-c(1, 0, 3, 1, 1, 2, 6, 0, 1, 1, 12, 2, NA, 6, 12, 0, 10,
7, 4, 10, 13, 1, 2, 1, 18, 6, 3, 8, 3, 1, 9, 15, 6, 12,
6, 15, 13, 6, 8, 6, 2, 11, 6, 1, 12, 0, 4, 15, 0, 3, 18,
5, 6, 17, 5, 3, 17, 8, 0, 7, 12, 10, 26, 12, 4, 17, 1, 8,
2, 7, 14, 8)
no_success<-c(1, 9, 5, 4, 6, 1, 4, 4, 6, 10, 16, 4, NA, 3, NA, 3,
5, 5, 6, 10, 0, 5, 3, 10, 1, 7, 11, 8, 20, 4, 3, 3,
19, 1, 11, 4, 6, 4, 9, 4, 10, 4, 2, 8, 3, 1, 13, 3,
5, 7, 5, 9, 3, 6, 3, 4, 3, 13, 6, 5, 10, 3, 1, 0,
18, 6, 13, 0, 3, 2, 2, 2)
df<-data.frame(year,success,no_success)
df$success<-as.integer(df$success)
df$no_success<-as.integer(df$no_success)
If I want to know if there is a linear increase or decrease between year in regards to the success or no_success of a thought up treatment I apply a binominal glm:
m<- glm(cbind(success, no_success)~year,
data=df, family = "quasibinomial",
na.action=na.exclude)
summary(m)
I changed to "quasibinomial" here because of overdispersion.
From the summary I see that there is a significant effect: P: 0.0219 *
As the coefficients in a binomial glm represent log odds,
I get exp(estimate) = exp(0.3099) = 1.363
So, there is an increase in Odds of succes of 1.363 per year
My Questions are:
1.) When I exp(negative estimate) it gets always positive - this can not be correct. There must be a way to express negative relationships.
2.) When I want to visualize multiple linear models, I like to display the estimates.
In a "normal" lm I would display the estimate and confidence interval like this: divide the estimate by the mean of the observation and than substract and add the mean of observation/Std. Error times 1.96.
Estimate.mean<-exp(0.3099)/mean(df$or,na.rm=TRUE)
Std.Error.mean<-exp(0.1321)/mean(df$or,na.rm=TRUE)
low<-Estimate.mean-Std.Error.mean*1.96
high<-Estimate.mean+Std.Error.mean*1.96
If this confidence level is not touching the zero line it should be significant. The effect is significantly not greater than zero.
But here the low bound is -0.3901804 and the high bound is 1.608095. This does not appear to be a significant linear relationship despite the low p-value from the glm (0.0219).
What have I mixed up here?
I am happy for any suggestions
The "zero line" in this case is x=1 and not x=0.
Question 2:
the question is. Is there a effect that is different from zero?
But odds of 1 basicaly means zero.
Question 1:
When the estimate is exp the result can not be negative.But odds below 1 express a negative effect.
Here are some sources to calculate the confidence intervall for anyone stumbling over this post.
https://fromthebottomoftheheap.net/2018/12/10/confidence-intervals-for-glms/
https://stats.stackexchange.com/questions/304833/how-to-calculate-odds-ratio-and-95-confidence-interval-for-logistic-regression

Calculate intraclass correlation by group in R

I need some programming/statistic help.
I have a database with multiple groups (variable "group"). The members of each group rated some items (in our example-dataset the variables "var1", "var2" and "var3").
I would like to get the intraclass variance for each group. In particular i would like to calculate the r*wg(j), ICC(1) and ICC(2).
I looked for a solution but the icc function in r expect to have the raters (my team members) as columns and not as row. I could find a way to do it by creating a subset for every group and then transposing every dataset but I believe there is an easier solution.
Thanks to anyone who can help me with this.
group <- c(1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4)
var1 <- c(4, 5, 4, 2, 3, 4, 5, 3, 5, 8, 4, 3, 4, 4, 5)
var2 <- c(2, 3, 4, 2, 4, 4, 5, 6, 6, 9, 3, 3, 2, 5, 4)
var3 <- c(4, 5, 6, 2, 3, 6, 7, 6, 7, 8, 5, 6, 3, 3, 6)
df <- data.frame(group, var1, var2, var3)

Code to analyze relationships between responses to different ranking questions on a survey

My goal is to find much simpler code, which can generalize, that shows the relationships between responses to two survey questions. In the MWE, one question asked respondents to rank eight marketing selections from 1 to 8 and the other asked them to rank nine attribute selections from 1 to 9. Higher rankings indicate the respondent favored the selection more. Here is the data frame.
structure(list(Email = c("a", "b", "c", "d", "e", "f", "g", "h",
"i"), Ads = c(2, 1, 1, 1, 1, 2, 1, 1, 1), Alumni = c(3, 2, 2,
3, 2, 3, 2, 2, 2), Articles = c(6, 4, 3, 2, 3, 4, 3, 3, 3), Referrals = c(4,
3, 4, 8, 7, 8, 8, 6, 4), Speeches = c(7, 7, 6, 7, 4, 7, 4, 5,
5), Updates = c(8, 6, 6, 5, 5, 5, 5, 7, 6), Visits = c(5, 8,
7, 6, 6, 6, 6, 4, 8), `Business Savvy` = c(10, 6, 10, 10, 4,
4, 6, 8, 9), Communication = c(4, 3, 8, 3, 3, 9, 7, 6, 7), Experience = c(7,
7, 7, 9, 2, 8, 5, 9, 5), Innovation = c(2, 1, 4, 2, 1, 2, 2,
1, 1), Nearby = c(3, 2, 2, 1, 5, 3, 3, 2, 2), Personal = c(8,
10, 6, 8, 6, 10, 4, 3, 3), Rates = c(9, 5, 9, 6, 9, 7, 10, 5,
4), `Staffing Model` = c(6, 8, 5, 5, 7, 5, 8, 7, 8), `Total Cost` = c(5,
4, 3, 7, 8, 6, 9, 4, 6)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
If numeric rankings cannot be used for my solution to calculating relationships (correlations), please correct me.
Hoping they can be used, I arrived at the following plodding code, which I hope calculates the correlation matrix of each method selection against each attribute selection.
library(psych)
dataframe2 <- psych::corr.test(dataframe[ , c(2, 9:17)])[[1]][1:10] # the first method vs all attributes
dataframe3 <- psych::corr.test(dataframe[ , c(3, 9:17)])[[1]][1:10] # the 2nd method vs all attributes and so on
dataframe4 <- psych::corr.test(dataframe[ , c(4, 9:17)])[[1]][1:10]
dataframe5 <- psych::corr.test(dataframe[ , c(5, 9:17)])[[1]][1:10]
dataframe6 <- psych::corr.test(dataframe[ , c(6, 9:17)])[[1]][1:10]
dataframe7 <- psych::corr.test(dataframe[ , c(7, 9:17)])[[1]][1:10]
dataframe8 <- psych::corr.test(dataframe[ , c(8, 9:17)])[[1]][1:10]
# create a dataframe from the rbinded rows
bind <- data.frame(rbind(dataframe2, dataframe3, dataframe4, dataframe5, dataframe6, dataframe7, dataframe8))
Rename rows and columns:
colnames(bind) <- c("Sel", colnames(dataframe[9:17]))
rownames(bind) <- colnames(dataframe[2:8])
How can I accomplish the above more efficiently?
By the way, the bind data frame also allows one to produce a heat map with the DataExplorer package.
library(DataExplorer)
DataExplorer::plot_correlation(bind)
[Summary]
In the scope of our discussion, there are two ways to get the correlation data.
Use stats::cor, i.e., cor(subset(dataframe, select = -Email))
Use psych::corr.test, i.e., corr.test(subset(dataframe, select = -Email))[[1]]
Then you may subset the correlation matrix with the desired rows and columns.
In order to use DataExplorer::plot_correlation, you can simply do plot_correlation(dataframe, type = "c"). Note: the output heatmap will include correlations for all columns, so you can just ignore columns that are not of interests.
[Original Answer]
## Create data
dataframe <- structure(
list(
Email = c("a", "b", "c", "d", "e", "f", "g", "h", "i"),
Ads = c(2, 1, 1, 1, 1, 2, 1, 1, 1),
Alumni = c(3, 2, 2, 3, 2, 3, 2, 2, 2),
Articles = c(6, 4, 3, 2, 3, 4, 3, 3, 3),
Referrals = c(4, 3, 4, 8, 7, 8, 8, 6, 4),
Speeches = c(7, 7, 6, 7, 4, 7, 4, 5, 5),
Updates = c(8, 6, 6, 5, 5, 5, 5, 7, 6),
Visits = c(5, 8, 7, 6, 6, 6, 6, 4, 8),
`Business Savvy` = c(10, 6, 10, 10, 4, 4, 6, 8, 9),
Communication = c(4, 3, 8, 3, 3, 9, 7, 6, 7),
Experience = c(7, 7, 7, 9, 2, 8, 5, 9, 5),
Innovation = c(2, 1, 4, 2, 1, 2, 2, 1, 1),
Nearby = c(3, 2, 2, 1, 5, 3, 3, 2, 2),
Personal = c(8, 10, 6, 8, 6, 10, 4, 3, 3),
Rates = c(9, 5, 9, 6, 9, 7, 10, 5, 4),
`Staffing Model` = c(6, 8, 5, 5, 7, 5, 8, 7, 8),
`Total Cost` = c(5, 4, 3, 7, 8, 6, 9, 4, 6)
),
row.names = c(NA, -9L),
class = c("tbl_df", "tbl", "data.frame")
)
Following your example strictly, we can do the following:
## Calculate correlation
df2 <- subset(dataframe, select = -Email)
marketing_selections <- names(df2)[1:7]
attribute_selections <- names(df2)[8:16]
corr_matrix <- psych::corr.test(df2)[[1]]
bind <- subset(corr_matrix,
subset = rownames(corr_matrix) %in% marketing_selections,
select = attribute_selections)
DataExplorer::plot_correlation(bind)
WARNING
However, is this what you really want? psych::corr.test generates the correlation matrix, and DataExplorer::plot_correlation calculates the correlation again. It is like the correlation of the correlation.

Is it possible to limit forecasts made by bsts to positive values only?

I am learning to use various forecasting packages available in R, and came across bsts(). The data I deal with is a time series of demands.
data=c(27, 2, 7, 7, 9, 4, 3, 3, 3, 9, 6, 2, 6, 2, 3, 8, 6, 1, 3, 8, 4, 5, 8, 5, 4, 4, 6, 1, 6, 5, 1, 3, 0, 2, 6, 7, 1, 2, 6, 2, 8, 6, 1, 1, 3, 2, 1, 3, 1, 6, 3, 4, 3, 7, 3, 4, 1, 7, 5, 6, 3, 4, 3, 9, 2, 1, 7, 2, 2, 9, 4, 5, 3, 4, 2, 4, 4, 8, 6, 3, 9, 2, 9, 4, 1, 3, 8, 1, 7, 7, 6, 0, 1, 4, 8, 9, 2, 5)
ts.main=ts(data, start=c(1910,1), frequency=12)
ss <- AddLocalLinearTrend(list(), y=ts.main)
ss <- AddSeasonal(ss, y=as.numeric(ts.temp), nseasons=12)
model <- bsts(as.numeric(ts.temp),
state.specification = ss,
niter = 1000)
pred <- predict(model, horizon = 12)
Is there way I can restrict pred$mean from becoming negative?
Since your data are a time series of counts, you need to take that into account rather than assume Gaussian errors; for some discussion on this and elaboration of some approaches, see for example Brandt et al 2000 and Brandt and Williams 2001. Luckily, the bsts package has a built-in functionality for this, the family option (see pages 24 to 26 of the documentation).
So, you can just do this
model <- bsts(as.numeric(ts.main),
state.specification = ss,
family = 'poisson',
niter = 1000)
so that the bsts() function correctly considers the data as counts, which will solve your issue, since the draws from the posterior predictive distribution will then be non-negative by definition.

R: displaying scientific notation

chocolate <- data.frame(
Sabor =
c(5, 7, 3,
4, 2, 6,
5, 3, 6,
5, 6, 0,
7, 4, 0,
7, 7, 0,
6, 6, 0,
4, 6, 1,
6, 4, 0,
7, 7, 0,
2, 4, 0,
5, 7, 4,
7, 5, 0,
4, 5, 0,
6, 6, 3
),
Tipo = factor(rep(c("A", "B", "C"), 15)),
Provador = factor(rep(1:15, rep(3, 15))))
tapply(chocolate$Sabor, chocolate$Tipo, mean)
ajuste <- lm(chocolate$Sabor ~ chocolate$Tipo + chocolate$Provador)
summary(ajuste)
anova(ajuste)
a1 <- aov(chocolate$Sabor ~ chocolate$Tipo + chocolate$Provador)
posthoc <- TukeyHSD(x=a1, 'chocolate$Tipo', conf.level=0.95)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = chocolate$Sabor ~ chocolate$Tipo + chocolate$Provador)
$`chocolate$Tipo`
diff lwr upr p adj
B-A -0.06666667 -1.803101 1.669768 0.9950379
C-A -3.80000000 -5.536435 -2.063565 0.0000260
C-B -3.73333333 -5.469768 -1.996899 0.0000337
Here is some sample code using TukeyHSD. The output is a matrix, and I want the values to be displayed in scientific notation. I've tried using scipen and setting options(digits = 20) but some of my values from my actual data are still way too small so that the p adj values are 0.00000000000000000000
How can I get the values to be displayed in scientific notation?
You could do this:
format(posthoc, scientific = TRUE)
If you want to change the number of digits, for instance using 3, you could do this:
format(posthoc, scientific = TRUE, digits = 3)

Resources