I have a data set composed of 2 subjects and measures 8 times for each subject.
dat <- data.frame(c(1, 1, 2, 2), rep(c("t1", "t2"), 2), c(50, 52, 49, 51))
colnames(dat) <- c("subject", "time", "result")
dat <- dat %>% mutate(subject = as.factor(subject)) %>%
mutate(time = as.factor(time))
and so on for the rest of the 6 times left.
I am trying to apply a repeated-measures ANOVA to see if the effect of time is significant for each subject, but I keep getting DFd is zero, when it is actually 1.
aov <- dat %>% anova_test(dv = result, wid = subject, within = time, type = 2, detailed = TRUE)
get_anova_table(aov, correction = "none")
Can someone please help me?
Related
Trying to add a column of statistical test results in gtsummary::tbl_continuous().
Here is an example:
library(tidyverse)
library(gtsummary)
df <- tibble(
Group = as.factor(c(rep('A', 8), rep('B', 8), rep('C', 8))),
Variable = as.factor(rep(c(rep('x', 4), rep('y', 4)), 3)),
Value = runif(24, 0, 10)
)
df %>%
tbl_continuous(
variable = Value,
by = Variable
) %>%
add_p()
which produces:
I would like to add t-tests (or more generally any appropriate test, ANOVA for instance if there were three columns) to the p-value column for each row to compare (x, y) between groups. I tried implementing the suggestion here without success.
I am trying to forecast student behaviour by year. It isn't working, maybe because my data is too small. I'm using Arima; however, the trend line keeps showing a straight line which I'm not sure is right. Might be this because ARIMA shows ARIMA(0,0,0) with non-zero mean.
Year - General
Students - Numeric
How can I forecast a student's behaviour by year?
Small data set, but this is one way to go about it. Modelling each student separately (with student as a key), and using the tidyverts approach:
library(dplyr)
library(tidyr)
library(tsibble)
library(feasts)
library(fable)
Data set
df <- structure(list(Year = structure(c(1995, 1996, 1997), class = "numeric"),
Student1 = c(3, 1, 3), Student2 = c(2, 2, 2), Student3 = c(2,
3, 3), Student4 = c(2, 3, 2), Student5 = c(3, 3, 4)), row.names = c(NA,
3L), class = "data. Frame")
Tidy data
df <- df |> pivot_longer(names_to = "Student", cols = starts_with("Student")) |>
as_tsibble(index = Year, key = Student)
Visualise
df |> autoplot()
df |>
filter(Student == "Student1") |>
gg_tsdisplay(value, plot_type = 'partial')
Fit ARIMA
Stu_fit <- df |>
model(search = ARIMA(value, stepwise = FALSE))
Check fit
glance(Stu_fit)
Stu_fit$search
Stu_fit |>
filter(Student == "Student1") |>
gg_tsresiduals()
Forecast
Stu_fit |>
forecast(h = 5) |>
filter(.model == 'search') |>
autoplot()
Hope this is helps! :-)
I have a longitudinal dataset with acceptance data of people of different declared majors. At each time point (2021, 2020, etc.), I want to see if the acceptance rate of undeclared individuals differ significantly (in either direction) from declared individuals.
I'm eventually going to pipe these results into a plot that puts an asterik if groups differ significantly, but I wanted to know if there's an efficient way to perform these logisitic regressions so I get a column for each in my dataset that says if the group was significantly different than the undeclared students at the same time point.
To illustrate, here's a test dataset:
library(dplyr)
library(lubridate)
test <- tibble(major = as.factor(c(rep(c("undeclared", "computer science", "english"), 3))),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 3))),
admit = c(500, 1000, 450, 800, 300, 100, 1000, 400, 150),
reject = c(1000, 300, 1000, 210, 100, 900, 1500, 350, 1200)) %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE),
accept_rate = admit/total)
And here's how I would manually perform each regression (but don't want this)
test$major <- relevel(test$major , ref = "undeclared")
just_2021 <- test %>%
filter(time == '2021-01-01')
m_2021 <- glm(accept_rate ~ major, data = just_2021, weights = total, family = binomial)
summary(m_2021) #english not sig diff from undeclared; CS is sig diff from undeclared
And finally, this is what I'm hoping my dataset looks like:
library(dplyr)
library(lubridate)
answer <- tibble(major = as.factor(c(rep(c("undeclared", "computer science", "english"), 3))),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 3))),
admit = c(500, 1000, 450, 800, 300, 100, 1000, 400, 150),
reject = c(1000, 300, 1000, 210, 100, 900, 1500, 350, 1200)) %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE),
accept_rate = admit/total) %>%
mutate(dif_than_undeclared_2021 = c(NA_character_, "Yes", "No", rep(NA_character_, 6)),
dif_than_undeclared_2020 = c(rep(NA_character_, 4), "Yes", "Yes", rep(NA_character_, 3)),
dif_than_undeclared_2019 = c(rep(NA_character_, 7), "Yes", "Yes"))
answer
I know that purrr can help with iteration, but I don't know if it applies in this case. Any help would be gladly appreciated!
library(broom)
library(tidyr)
library(dplyr)
test %>%
# create year column
mutate(year = year(time),
major = relevel(major, "undeclared")) %>%
# nest by year
nest(data = -year) %>%
# compute regression
mutate(reg = map(data, ~glm(accept_rate ~ major, data = .,
family = binomial, weights = total)),
# use broom::tidy to make a tibble out of model object
reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy)) %>%
filter(term != "(Intercept)") %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(-c(term, estimate, std.error, statistic, p.value, reg))
How to easily generate/simulate meaningful example data for modelling: e.g. telling that give me n rows of data, for 2 groups, their sex distributions and mean age should differ by X and Y units, respectively? Is there a simple way for doing it automatically? Any packages?
For example, what would be the simplest way for generating such data?
groups: two groups: A, B
sex: different sex distributions: A 30%, B 70%
age: different mean ages: A 50, B 70
PS! Tidyverse solutions are especially welcome.
My best try so far is still quite a lot of code:
n=100
d = bind_rows(
#group A females
tibble(group = rep("A"),
sex = rep("Female"),
age = rnorm(n*0.4, 50, 4)),
#group B females
tibble(group = rep("B"),
sex = rep("Female"),
age = rnorm(n*0.3, 45, 4)),
#group A males
tibble(group = rep("A"),
sex = rep("Male"),
age = rnorm(n*0.20, 60, 6)),
#group B males
tibble(group = rep("B"),
sex = rep("Male"),
age = rnorm(n*0.10, 55, 4)))
d %>% group_by(group, sex) %>%
summarise(n = n(),
mean_age = mean(age))
There are lots of ways to sample from vectors and to draw from random distributions in R. For example, the data set you requested could be created like this:
set.seed(69) # Makes samples reproducible
df <- data.frame(groups = rep(c("A", "B"), each = 100),
sex = c(sample(c("M", "F"), 100, TRUE, prob = c(0.3, 0.7)),
sample(c("M", "F"), 100, TRUE, prob = c(0.5, 0.5))),
age = c(runif(100, 25, 75), runif(100, 50, 90)))
And we can use the tidyverse to show it does what was expected:
library(dplyr)
df %>%
group_by(groups) %>%
summarise(age = mean(age),
percent_male = length(which(sex == "M")))
#> # A tibble: 2 x 3
#> groups age percent_male
#> <chr> <dbl> <int>
#> 1 A 49.4 29
#> 2 B 71.0 50
I create some models like this using a nested tidyr dataframe:
set.seed(1)
library(tidyr)
library(dplyr)
library(sjPlot)
library(tibble)
library(purrr)
fits <- tribble(~group, ~colA, ~colB, ~colC,
sample(c("group1", "group2"), 10, replace = T), 0, sample(10, replace = T), sample(10, replace = T),
sample(c("group1", "group2"), 10, replace = T), 1, sample(10, replace = T), sample(10, replace = T)) %>%
unnest(cols = c(colB, colC)) %>%
nest(data=-group) %>%
mutate(fit= map(data, ~glm(formula = colA ~ colB + colC, data = .x, family="binomial"))) %>%
dplyr::select(group, fit) %>%
tibble::column_to_rownames("group")
I would like to use this data to create some quick marginal effects plots with sjPlot::plot_models like this
plot_models(as.list(fits), type = "pred", terms = c("colB", "colA", "colC"))
Unfortunately, I get the error
Error in if (fam.info$is_linear) tf <- NULL else tf <- "exp" :
argument is of length zero
In addition: Warning message:
Could not access model information.
I've played around a bit with the nesting of the data but I've been unable to get it into a format that sjPlot::plot_models will accept.
What I was expecting to get is a "Forest plot of multiple regression models" as described in the help file. Ultimately, the goal is to plot the marginal effects of regression models by group, which I was hoping the plot_models will do (please correct me if I'm wrong).
It think there are some issues with the original code as well as with the data. There are arguments from plot_model in the function call which are not supported in plot_models. I first show an example that shows how plot_models can be called and used with a nested tibble using {ggplot2}'s diamonds data set. Then I apply this approach to the OP's sample data, which doesn't yield useable results*. Finally, I create some new toy data to show how the approach could be applied to a binominal model.
(* In the original toy data the dependent variable is either always 0 or always 1 in each model so this is unlikely to yield useable results).
set.seed(1)
library(tidyr)
library(dplyr)
library(sjPlot)
library(tibble)
library(ggplot2)
# general example
fits <- tibble(id = c("x", "y", "z")) %>%
rowwise() %>%
mutate(fit = list(glm(reformulate(
termlabels = c("cut", "color", "depth", "table", "price", id),
response = "carat"),
data = diamonds)))
plot_models(fits$fit)
# OP's example data
fits2 <- tribble(~group, ~colA, ~colB, ~colC,
sample(c("group1", "group2"), 10, replace = T), 0,
sample(10, replace = T), sample(10, replace = T),
sample(c("group1", "group2"), 10, replace = T), 1,
sample(10, replace = T),
sample(10, replace = T)) %>%
unnest(cols = c(colB, colC)) %>%
nest(data = -group) %>%
rowwise() %>%
mutate(fit = list(glm(formula = colA ~ colB + colC, data = data, family="binomial")))
plot_models(fits2$fit)
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Warning: Removed 4 rows containing missing values (geom_point).
# new data for binominal model
n <- 500
g <- round(runif(n, 0L, 1L), 0)
x1 <- runif(n,0,100)
x2 <- runif(n,0,100)
y <- (x2 - x1 + rnorm(n,sd=20)) < 0
fits3 <- tibble(g, y, x1, x2) %>%
nest_by(g) %>%
mutate(fit = list(glm(formula = y ~ x1 + x2, data = data, family="binomial")))
plot_models(fits3$fit)
Created on 2021-01-23 by the reprex package (v0.3.0)