Run multiple regressions grouped by column in R [duplicate] - r

I want to do a linear regression in R using the lm() function. My data is an annual time series with one field for year (22 years) and another for state (50 states). I want to fit a regression for each state so that at the end I have a vector of lm responses. I can imagine doing for loop for each state then doing the regression inside the loop and adding the results of each regression to a vector. That does not seem very R-like, however. In SAS I would do a 'by' statement and in SQL I would do a 'group by'. What's the R way of doing this?

Since 2009, dplyr has been released which actually provides a very nice way to do this kind of grouping, closely resembling what SAS does.
library(dplyr)
d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
year=rep(1:10, 2),
response=c(rnorm(10), rnorm(10)))
fitted_models = d %>% group_by(state) %>% do(model = lm(response ~ year, data = .))
# Source: local data frame [2 x 2]
# Groups: <by row>
#
# state model
# (fctr) (chr)
# 1 CA <S3:lm>
# 2 NY <S3:lm>
fitted_models$model
# [[1]]
#
# Call:
# lm(formula = response ~ year, data = .)
#
# Coefficients:
# (Intercept) year
# -0.06354 0.02677
#
#
# [[2]]
#
# Call:
# lm(formula = response ~ year, data = .)
#
# Coefficients:
# (Intercept) year
# -0.35136 0.09385
To retrieve the coefficients and Rsquared/p.value, one can use the broom package. This package provides:
three S3 generics: tidy, which summarizes a model's
statistical findings such as coefficients of a regression;
augment, which adds columns to the original data such as
predictions, residuals and cluster assignments; and glance, which
provides a one-row summary of model-level statistics.
library(broom)
fitted_models %>% tidy(model)
# Source: local data frame [4 x 6]
# Groups: state [2]
#
# state term estimate std.error statistic p.value
# (fctr) (chr) (dbl) (dbl) (dbl) (dbl)
# 1 CA (Intercept) -0.06354035 0.83863054 -0.0757668 0.9414651
# 2 CA year 0.02677048 0.13515755 0.1980687 0.8479318
# 3 NY (Intercept) -0.35135766 0.60100314 -0.5846187 0.5749166
# 4 NY year 0.09385309 0.09686043 0.9689519 0.3609470
fitted_models %>% glance(model)
# Source: local data frame [2 x 12]
# Groups: state [2]
#
# state r.squared adj.r.squared sigma statistic p.value df
# (fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (int)
# 1 CA 0.004879969 -0.119510035 1.2276294 0.0392312 0.8479318 2
# 2 NY 0.105032068 -0.006838924 0.8797785 0.9388678 0.3609470 2
# Variables not shown: logLik (dbl), AIC (dbl), BIC (dbl), deviance (dbl),
# df.residual (int)
fitted_models %>% augment(model)
# Source: local data frame [20 x 10]
# Groups: state [2]
#
# state response year .fitted .se.fit .resid .hat
# (fctr) (dbl) (int) (dbl) (dbl) (dbl) (dbl)
# 1 CA 0.4547765 1 -0.036769875 0.7215439 0.4915464 0.3454545
# 2 CA 0.1217003 2 -0.009999399 0.6119518 0.1316997 0.2484848
# 3 CA -0.6153836 3 0.016771076 0.5146646 -0.6321546 0.1757576
# 4 CA -0.9978060 4 0.043541551 0.4379605 -1.0413476 0.1272727
# 5 CA 2.1385614 5 0.070312027 0.3940486 2.0682494 0.1030303
# 6 CA -0.3924598 6 0.097082502 0.3940486 -0.4895423 0.1030303
# 7 CA -0.5918738 7 0.123852977 0.4379605 -0.7157268 0.1272727
# 8 CA 0.4671346 8 0.150623453 0.5146646 0.3165112 0.1757576
# 9 CA -1.4958726 9 0.177393928 0.6119518 -1.6732666 0.2484848
# 10 CA 1.7481956 10 0.204164404 0.7215439 1.5440312 0.3454545
# 11 NY -0.6285230 1 -0.257504572 0.5170932 -0.3710185 0.3454545
# 12 NY 1.0566099 2 -0.163651479 0.4385542 1.2202614 0.2484848
# 13 NY -0.5274693 3 -0.069798386 0.3688335 -0.4576709 0.1757576
# 14 NY 0.6097983 4 0.024054706 0.3138637 0.5857436 0.1272727
# 15 NY -1.5511940 5 0.117907799 0.2823942 -1.6691018 0.1030303
# 16 NY 0.7440243 6 0.211760892 0.2823942 0.5322634 0.1030303
# 17 NY 0.1054719 7 0.305613984 0.3138637 -0.2001421 0.1272727
# 18 NY 0.7513057 8 0.399467077 0.3688335 0.3518387 0.1757576
# 19 NY -0.1271655 9 0.493320170 0.4385542 -0.6204857 0.2484848
# 20 NY 1.2154852 10 0.587173262 0.5170932 0.6283119 0.3454545
# Variables not shown: .sigma (dbl), .cooksd (dbl), .std.resid (dbl)

Here's an approach using the plyr package:
d <- data.frame(
state = rep(c('NY', 'CA'), 10),
year = rep(1:10, 2),
response= rnorm(20)
)
library(plyr)
# Break up d by state, then fit the specified model to each piece and
# return a list
models <- dlply(d, "state", function(df)
lm(response ~ year, data = df))
# Apply coef to each model and return a data frame
ldply(models, coef)
# Print the summary of each model
l_ply(models, summary, .print = TRUE)

Here's one way using the lme4 package.
library(lme4)
d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
year=rep(1:10, 2),
response=c(rnorm(10), rnorm(10)))
xyplot(response ~ year, groups=state, data=d, type='l')
fits <- lmList(response ~ year | state, data=d)
fits
#------------
Call: lmList(formula = response ~ year | state, data = d)
Coefficients:
(Intercept) year
CA -1.34420990 0.17139963
NY 0.00196176 -0.01852429
Degrees of freedom: 20 total; 16 residual
Residual standard error: 0.8201316

In my opinion is a mixed linear model a better approach for this kind of data. The code below given in the fixed effect the overall trend. The random effects indicate how the trend for each individual state differ from the global trend. The correlation structure takes the temporal autocorrelation into account. Have a look at Pinheiro & Bates (Mixed Effects Models in S and S-Plus).
library(nlme)
lme(response ~ year, random = ~year|state, correlation = corAR1(~year))

A nice solution using data.table was posted here in CrossValidated by #Zach.
I'd just add that it is possible to obtain iteratively also the regression coefficient r^2:
## make fake data
library(data.table)
set.seed(1)
dat <- data.table(x=runif(100), y=runif(100), grp=rep(1:2,50))
##calculate the regression coefficient r^2
dat[,summary(lm(y~x))$r.squared,by=grp]
grp V1
1: 1 0.01465726
2: 2 0.02256595
as well as all the other output from summary(lm):
dat[,list(r2=summary(lm(y~x))$r.squared , f=summary(lm(y~x))$fstatistic[1] ),by=grp]
grp r2 f
1: 1 0.01465726 0.714014
2: 2 0.02256595 1.108173

I think it's worthwhile to add the purrr::map approach to this problem.
library(tidyverse)
d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
year=rep(1:10, 2),
response=c(rnorm(10), rnorm(10)))
d %>%
group_by(state) %>%
nest() %>%
mutate(model = map(data, ~lm(response ~ year, data = .)))
See #Paul Hiemstra's answer for further ideas on using the broom package with these results.

I now my answer comes a bit late, but I was looking for a similar functionality. It would seem the built-in function 'by' in R can also do the grouping easily:
?by contains the following example, which fits per group and extracts the coefficients with sapply:
require(stats)
## now suppose we want to extract the coefficients by group
tmp <- with(warpbreaks,
by(warpbreaks, tension,
function(x) lm(breaks ~ wool, data = x)))
sapply(tmp, coef)

## make fake data
ngroups <- 2
group <- 1:ngroups
nobs <- 100
dta <- data.frame(group=rep(group,each=nobs),y=rnorm(nobs*ngroups),x=runif(nobs*ngroups))
head(dta)
#--------------------
group y x
1 1 0.6482007 0.5429575
2 1 -0.4637118 0.7052843
3 1 -0.5129840 0.7312955
4 1 -0.6612649 0.9028034
5 1 -0.5197448 0.1661308
6 1 0.4240346 0.8944253
#------------
## function to extract the results of one model
foo <- function(z) {
## coef and se in a data frame
mr <- data.frame(coef(summary(lm(y~x,data=z))))
## put row names (predictors/indep variables)
mr$predictor <- rownames(mr)
mr
}
## see that it works
foo(subset(dta,group==1))
#=========
Estimate Std..Error t.value Pr...t.. predictor
(Intercept) 0.2176477 0.1919140 1.134090 0.2595235 (Intercept)
x -0.3669890 0.3321875 -1.104765 0.2719666 x
#----------
## one option: use command by
res <- by(dta,dta$group,foo)
res
#=========
dta$group: 1
Estimate Std..Error t.value Pr...t.. predictor
(Intercept) 0.2176477 0.1919140 1.134090 0.2595235 (Intercept)
x -0.3669890 0.3321875 -1.104765 0.2719666 x
------------------------------------------------------------
dta$group: 2
Estimate Std..Error t.value Pr...t.. predictor
(Intercept) -0.04039422 0.1682335 -0.2401081 0.8107480 (Intercept)
x 0.06286456 0.3020321 0.2081387 0.8355526 x
## using package plyr is better
library(plyr)
res <- ddply(dta,"group",foo)
res
#----------
group Estimate Std..Error t.value Pr...t.. predictor
1 1 0.21764767 0.1919140 1.1340897 0.2595235 (Intercept)
2 1 -0.36698898 0.3321875 -1.1047647 0.2719666 x
3 2 -0.04039422 0.1682335 -0.2401081 0.8107480 (Intercept)
4 2 0.06286456 0.3020321 0.2081387 0.8355526 x

The lm() function above is an simple example. By the way, I imagine that your database has the columns as in the following form:
year state var1 var2 y...
In my point of view, you can to use the following code:
require(base)
library(base)
attach(data) # data = your data base
#state is your label for the states column
modell<-by(data, data$state, function(data) lm(y~I(1/var1)+I(1/var2)))
summary(modell)

The question seems to be about how to call regression functions with formulas which are modified inside a loop.
Here is how you can do it in (using diamonds dataset):
attach(ggplot2::diamonds)
strCols = names(ggplot2::diamonds)
formula <- list(); model <- list()
for (i in 1:1) {
formula[[i]] = paste0(strCols[7], " ~ ", strCols[7+i])
model[[i]] = glm(formula[[i]])
#then you can plot the results or anything else ...
png(filename = sprintf("diamonds_price=glm(%s).png", strCols[7+i]))
par(mfrow = c(2, 2))
plot(model[[i]])
dev.off()
}

Related

Generating repeated measures dataset

I'm looking to generate a dataset in R for a repeated measures model and I'm not sure where to start.
The outcome of interest is continuous between 0-100. This is for a two arm trial (say groups "a" and "b"), with 309 participants in each arm. Each participant is assessed at baseline, then fortnightly for one year (27 total assessments). There will be loss to followup and withdrawals over the year (~30% after one year), and participants may miss individual assessments at random.
For now, I am assuming the standard deviation is the same at each timepoint, and for both arms (11). The mean will change over time. I'm working on the assumption each participant's score is correlated with their baseline measurement.
How can I generate this dataset? I'm intending to compare repeated measures regression methods.
I think the following fulfils your requirements. It works by taking the cumulative sum of samples from a normal distribution over 27 weeks and converting these into a logistic scale between 0 and 100 (so that the maximum / minimum scores are never breached). It uses replicate to do this for 309 participants. It then simulates 30% drop outs by choosing random participants and a random week, following which their measurements are all NA. It also adds in some random missing weeks for the rest of the participants. The result is pivoted into long format to allow for easier analysis.
library(tidyverse)
set.seed(1)
# Generate correlated scores for 309 people over 27 visits
df <- setNames(cbind(data.frame(ID = 1:309, t(replicate(309, {
x <- cumsum(rnorm(27, 0.05, 0.1))
round(100 * exp(x)/(1 + exp(x)))
})))), c('ID', paste0('Visit_', 1:27)))
# Model dropouts at 30% rate
dropout <- sample(c(TRUE, FALSE), 309, TRUE, prob = c(0.7, 0.3))
df[cbind(which(!dropout), sample(2:28, sum(!dropout), TRUE))] <- NA
df <- as.data.frame(t(apply(df, 1, function(x) ifelse(is.na(cumsum(x)), NA,x))))
# Add random missing visits
df[cbind(sample(309, 100, TRUE), sample(2:28, 100, TRUE))] <- NA
df <- pivot_longer(df, -ID, names_to = 'Week', values_to = 'Score') %>%
mutate(Week = 2 * (as.numeric(gsub('\\D+', '', Week)) - 1))
Our data frame now looks like this:
head(df)
#> # A tibble: 6 x 3
#> ID Week Score
#> <dbl> <dbl> <dbl>
#> 1 1 0 50
#> 2 1 2 51
#> 3 1 4 51
#> 4 1 6 56
#> 5 1 8 58
#> 6 1 10 57
And we can see the scores drift upward over time (since we set a small positive mu on our rnorm when creating the scores.
lm(Score ~ Week, data = df)
#>
#> Call:
#> lm(formula = Score ~ Week, data = df)
#>
#> Coefficients:
#> (Intercept) Week
#> 52.2392 0.5102
We can plot and see the overall shape of the scores and their spread:
ggplot(df, aes(Week, Score, group = ID)) + geom_line(alpha = 0.1)
Created on 2023-01-31 with reprex v2.0.2

Separating categorical variables in a linear regression

I fit a linear model using two categorical variables and one numerical variable as follows:
data(iris)
iris2 <- iris %>%
mutate(petal_type= if_else(Petal.Length > 4, "petal_long", "petal_short"),
sepal_type = if_else(Sepal.Length > 6, "flower_long", "flower_short")
)
lm(Sepal.Width ~ sepal_type*petal_type + petal_type*Petal.Width, data = iris2)
# Coefficients:
# (Intercept) sepal_typeflower_short petal_typepetal_short
# 2.48793 -0.08721 1.51070
# Petal.Width sepal_typeflower_short:petal_typepetal_short petal_typepetal_short:Petal.Width
# 0.27131 -0.28965 -1.19334
I want to separate out the two categorical variables to get an estimate of the intercept and slope for each dummy variable. I would like to create a table like this that would describe the relationship between sepal.width and petal.width:
sepal_type petal_type Intercept_estimate Slope_estimate
flower_short petal_short
flower_short petal_long
flower_long petal_short
flower_long petal_long
I can do this by hand using different contrasts, but is there an easy way?
Thanks!
iris2 %>%
group_by(petal_type, sepal_type) %>%
summarise(model = list(coef(lm(Sepal.Width~Petal.Width))),
.groups = 'drop')%>%
unnest_wider(model)
petal_type sepal_type `(Intercept)` Petal.Width
<chr> <chr> <dbl> <dbl>
1 petal_long flower_long 2.33 0.355
2 petal_long flower_short 2.86 -0.0186
3 petal_short flower_long 2.8 NA
4 petal_short flower_short 3.62 -0.922

The Intercept of a categorical multiple regression R is not the mean value?

Let's say I have 2 (categorical) variables and one continuous:
library(tidyverse)
set.seed(123)
ds <- data.frame(
depression=rnorm(90,10,2),
schooling_dummy=c(0,1,2),
sex_dummy=c(0,1)
)
When I regress depression on sex (0 or 1), the intercept is 10.0436, what is the mean of sex = 0. Ok!
ds %>% group_by(sex_dummy) %>%
+ summarise(formatC(mean(depression),format="f", digits=4))
# A tibble: 2 x 2
sex_dummy `formatC(mean(depression), format = "f", digits = 4)`
<dbl> <chr>
1 0 10.0436
2 1.00 10.1640
The same thing happens when I regress depression on schooling. The intercept value is 10.4398. The mean of schooling = 0 is the same.
ds %>% group_by(schooling_dummy) %>%
+ summarise(formatC(mean(depression),format="f", digits=4))
# A tibble: 3 x 2
schooling_dummy `formatC(mean(depression), format = "f", digits = 4)`
<dbl> <chr>
1 0 10.4398
2 1.00 9.7122
3 2.00 10.1593
Now, when I compute a model with both variables, why the intercept is not the mean when both groups = 0? The regression **intercept is 10.3796, but the mean when sex = 0, and schooling is = 0 is 10.32548:
ds %>% group_by(schooling_dummy,sex_dummy) %>%
+ summarise(formatC(mean(depression),format="f", digits=5))
# A tibble: 6 x 3
# Groups: schooling_dummy [?]
schooling_dummy sex_dummy `formatC(mean(depression), format = "f", digits = 5)`
<dbl> <dbl> <chr>
1 0 0 10.32548
2 0 1.00 10.55404
3 1.00 0 9.59305
4 1.00 1.00 9.83139
5 2.00 0 10.21218
6 2.00 1.00 10.10648
When I predict the model when both are 0:
predict(mod3, data.frame(sex_dummy=0, schooling_dummy=0))
1
10.37956
This result is related to depression (of course...) but still not What I was expecting, since:
(Reference: https://www.theanalysisfactor.com/interpret-the-intercept/)
What is the same for this previous forum post
I aware of my variables are categorical and I'm adjusting my script, as you can reproduce using this code below:
Thanks
library(tidyverse)
set.seed(123)
ds <- data.frame(
depression=rnorm(90,10,2),
schooling_dummy=c(0,1,2),
sex_dummy=c(0,1)
)
mod <- lm(data=ds, depression ~ relevel(factor(sex_dummy), ref = "0"))
summary(mod)
ds %>% group_by(sex_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=4))
mod2 <- lm(data=ds, depression ~ relevel(factor(schooling_dummy), ref = "0"))
summary(mod2)
ds %>% group_by(schooling_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=4))
mod3 <- lm(data=ds, depression ~ relevel(factor(sex_dummy), ref = "0") +
relevel(factor(schooling_dummy), ref = "0"))
summary(mod3)
ds %>% group_by(schooling_dummy,sex_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=5))
predict(mod3, data.frame(sex_dummy=0, schooling_dummy=0))
Two errors in your thinking (although your R code works so it's not a programming error.
First and foremost you violated your own statement you have not dummy coded schooling it does not have only zeroes and ones it has 0,1 & 2.
Second you forgot the interaction effect in your lm modeling...
Try this...
library(tidyverse)
set.seed(123)
ds <- data.frame(
depression=rnorm(90,10,2),
schooling_dummy=c(0,1,2),
sex_dummy=c(0,1)
)
# if you explicitly make these variables factors not integers R will do the right thing with them
ds$schooling_dummy<-factor(ds$schooling_dummy)
ds$sex_dummy<-factor(ds$sex_dummy)
ds %>% group_by(schooling_dummy,sex_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=5))
# you need an asterisk in your lm model to include the interaction term
lm(depression ~ schooling_dummy * sex_dummy, data = ds)
The results give you the mean(s) you were expecting...
Call:
lm(formula = depression ~ schooling_dummy * sex_dummy, data = ds)
Coefficients:
(Intercept) schooling_dummy1 schooling_dummy2
10.325482 -0.732433 -0.113305
sex_dummy1 schooling_dummy1:sex_dummy1 schooling_dummy2:sex_dummy1
0.228561 0.009778 -0.334254
and FWIW you can avoid this sort of accidental misuse of categorical variables if your data is coded as characters to begin with... so if your data is coded this way:
ds <- data.frame(
depression=rnorm(90,10,2),
schooling=c("A","B","C"),
sex=c("Male","Female")
)
You're less likely to make the same mistake plus the results are easier to read...

Regression in R with Groups [duplicate]

This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 6 years ago.
I have imported a CSV with 3 columns , 2 columns for Y and X and the third column which identifies the category for X ( I have 20 groups/categories). I am able to run a regression at overall level but I want to run regression for the 20 categories separately and store the co-efs.
I tried the following :
list2env(split(sample, sample$CATEGORY_DESC), envir = .GlobalEnv)
Now I have 20 files, how do I run a regression on these 20 files and store the co-effs somewhere.
Since no data was provided, I am generating some sample data to show how you can run multiple regressions and store output using dplyr and broom packages.
In the following, there are 20 groups and different x/y values per group. 20 regressions are run and output of these regressions is provided as a data frame:
library(dplyr)
library(broom)
df <- data.frame(group = rep(1:20, 10),
x = rep(1:20, 10) + rnorm(200),
y = rep(1:20, 10) + rnorm(200))
df %>% group_by(group) %>% do(tidy(lm(x ~ y, data = .)))
Sample output:
Source: local data frame [40 x 6]
Groups: group [20]
group term estimate std.error statistic p.value
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 (Intercept) 0.42679228 1.0110422 0.4221310 0.684045203
2 1 y 0.45625124 0.7913256 0.5765657 0.580089051
3 2 (Intercept) 1.99367392 0.4731639 4.2134955 0.002941805
4 2 y 0.05101438 0.1909607 0.2671460 0.796114398
5 3 (Intercept) 3.14391308 0.8417638 3.7349114 0.005747126
6 3 y 0.08418715 0.2453441 0.3431391 0.740336702
Quick solution with lmList (package nlme):
library(nlme)
lmList(x ~ y | group, data=df)
Call:
Model: x ~ y | group
Data: df
Coefficients:
(Intercept) y
1 0.4786373 0.04978624
2 3.5125369 -0.94751894
3 2.7429958 -0.01208329
4 -5.2231576 2.24589181
5 5.6370824 -0.24223131
6 7.1785581 -0.08077726
7 8.2060808 -0.18283134
8 8.9072851 -0.13090764
9 10.1974577 -0.18514527
10 6.0687105 0.37396911
11 9.0682622 0.23469187
12 15.1081915 -0.29234452
13 17.3147636 -0.30306692
14 13.1352411 0.05873189
15 6.4006623 0.57619151
16 25.4454182 -0.59535396
17 22.0231916 -0.30073768
18 27.7317267 -0.54651597
19 10.9689733 0.45280604
20 23.3495704 -0.14488522
Degrees of freedom: 200 total; 160 residual
Residual standard error: 0.9536226
Borrowed the data df from #Gopala answer.
Consider also a base solution with lapply():
regressionList <- lapply(unique(df$group),
function(x) lm(x ~ y, df[df$group==x,]))
And only the coefficients:
coeffList <- lapply(unique(df$group),
function(x) lm(x ~ y, df[df$group==x,])$coefficients)
Even list of summaries:
summaryList <- lapply(unique(df$group),
function(x) summary(lm(x ~ y, df[df$group==x,])))

Fisher's and Pearson's test for indepedence

In R I have 2 datasets: group1 and group2.
For group 1 I have 10 game_id which is the id of a game, and we have number which is the numbers of times this games has been played in group1.
So if we type
group1
we get this output
game_id number
1 758565
2 235289
...
10 87084
For group2 we get
game_id number
1 79310
2 28564
...
10 9048
If I want to test if there is a statistical difference between group1 and group2 for the first 2 game_id I can use Pearson chi-square test.
In R I simply create the matrix
# The first 2 'numbers' in group1
a <- c( group1[1,2] , group1[2,2] )
# The first 2 'numbers' in group2
b <- c( group2[1,2], group2[2,2] )
# Creating it on matrix-form
m <- rbind(a,b)
So m gives us
a 758565 235289
b 79310 28564
Here I can test H: "a is independent from b", meaning that users in group1 play game_id 1 more than 2 compared to group2.
In R we type chisq.test(m) and we get a very low p-value meaning that we can reject H, meaning that a and b is not independent.
How should one find game_id's that are played significantly more in group1 than in group2 ?
I created a simpler version of only 3 games. I'm using a chi squared test and a proportions comparison test. Personally, I prefer the second one as it gives you an idea about what percentages you're comparing. Run the script and make sure you understand the process.
# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
number_games = c(758565,235289,87084))
dt_group1
# game_id number_games
# 1 1 758565
# 2 2 235289
# 3 3 87084
# add extra variables
dt_group1$number_rest_games = sum(dt_group1$number_games) - dt_group1$number_games # needed for chisq.test
dt_group1$number_all_games = sum(dt_group1$number_games) # needed for prop.test
dt_group1$Prc = dt_group1$number_games / dt_group1$number_all_games # just to get an idea about the percentages
dt_group1
# game_id number_games number_rest_games number_all_games Prc
# 1 1 758565 322373 1080938 0.70176550
# 2 2 235289 845649 1080938 0.21767113
# 3 3 87084 993854 1080938 0.08056336
# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
number_games = c(79310,28564,9048))
# add extra variables
dt_group2$number_rest_games = sum(dt_group2$number_games) - dt_group2$number_games
dt_group2$number_all_games = sum(dt_group2$number_games)
dt_group2$Prc = dt_group2$number_games / dt_group2$number_all_games
# input the game id you want to investigate
input_game_id = 1
# create a table of successes (games played) and failures (games not played)
dt_test = rbind(c(dt_group1$number_games[dt_group1$game_id==input_game_id], dt_group1$number_rest_games[dt_group1$game_id==input_game_id]),
c(dt_group2$number_games[dt_group2$game_id==input_game_id], dt_group2$number_rest_games[dt_group2$game_id==input_game_id]))
# perform chi sq test
chisq.test(dt_test)
# Pearson's Chi-squared test with Yates' continuity correction
#
# data: dt_test
# X-squared = 275.9, df = 1, p-value < 2.2e-16
# create a vector of successes (games played) and vector of total games
x = c(dt_group1$number_games[dt_group1$game_id==input_game_id], dt_group2$number_games[dt_group2$game_id==input_game_id])
y = c(dt_group1$number_all_games[dt_group1$game_id==input_game_id], dt_group2$number_all_games[dt_group2$game_id==input_game_id])
# perform test of proportions
prop.test(x,y)
# 2-sample test for equality of proportions with continuity correction
#
# data: x out of y
# X-squared = 275.9, df = 1, p-value < 2.2e-16
# alternative hypothesis: two.sided
# 95 percent confidence interval:
# 0.02063233 0.02626776
# sample estimates:
# prop 1 prop 2
# 0.7017655 0.6783155
The main thing is that chisq.test is a test that compares counts/proportions, so you need to provide the number of "successes" and "failures" for the groups you compare (contingency table as input). prop.test is another counts/proportions testing command that you need to provide the number of "successes" and "totals".
Now that you're happy with the result and you saw how the process works I'll add a more efficient way to perform those tests.
The first one is using dplyr and broom packages:
library(dplyr)
library(broom)
# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
number_games = c(758565,235289,87084),
group_id = 1) ## adding the id of the group
# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
number_games = c(79310,28564,9048),
group_id = 2) ## adding the id of the group
# combine datasets
dt = rbind(dt_group1, dt_group2)
dt %>%
group_by(group_id) %>% # for each group id
mutate(number_all_games = sum(number_games), # create new columns
number_rest_games = number_all_games - number_games,
Prc = number_games / number_all_games) %>%
group_by(game_id) %>% # for each game
do(tidy(prop.test(.$number_games, .$number_all_games))) %>% # perform the test
ungroup()
# game_id estimate1 estimate2 statistic p.value parameter conf.low conf.high
# (int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 1 0.70176550 0.67831546 275.89973 5.876772e-62 1 0.020632330 0.026267761
# 2 2 0.21767113 0.24429962 435.44091 1.063385e-96 1 -0.029216006 -0.024040964
# 3 3 0.08056336 0.07738492 14.39768 1.479844e-04 1 0.001558471 0.004798407
The other one is using data.table and broom packages:
library(data.table)
library(broom)
# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
number_games = c(758565,235289,87084),
group_id = 1) ## adding the id of the group
# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
number_games = c(79310,28564,9048),
group_id = 2) ## adding the id of the group
# combine datasets
dt = data.table(rbind(dt_group1, dt_group2))
# create new columns for each group
dt[, number_all_games := sum(number_games), by=group_id]
dt[, `:=`(number_rest_games = number_all_games - number_games,
Prc = number_games / number_all_games) , by=group_id]
# for each game id compare percentages
dt[, tidy(prop.test(.SD$number_games, .SD$number_all_games)) , by=game_id]
# game_id estimate1 estimate2 statistic p.value parameter conf.low conf.high
# 1: 1 0.70176550 0.67831546 275.89973 5.876772e-62 1 0.020632330 0.026267761
# 2: 2 0.21767113 0.24429962 435.44091 1.063385e-96 1 -0.029216006 -0.024040964
# 3: 3 0.08056336 0.07738492 14.39768 1.479844e-04 1 0.001558471 0.004798407
You can see that each row represent one game and the comparison is between group 1 and 2. You can get the p values from the corresponding column, but other info of the test/comparison as well.

Resources