I'm looking to generate a dataset in R for a repeated measures model and I'm not sure where to start.
The outcome of interest is continuous between 0-100. This is for a two arm trial (say groups "a" and "b"), with 309 participants in each arm. Each participant is assessed at baseline, then fortnightly for one year (27 total assessments). There will be loss to followup and withdrawals over the year (~30% after one year), and participants may miss individual assessments at random.
For now, I am assuming the standard deviation is the same at each timepoint, and for both arms (11). The mean will change over time. I'm working on the assumption each participant's score is correlated with their baseline measurement.
How can I generate this dataset? I'm intending to compare repeated measures regression methods.
I think the following fulfils your requirements. It works by taking the cumulative sum of samples from a normal distribution over 27 weeks and converting these into a logistic scale between 0 and 100 (so that the maximum / minimum scores are never breached). It uses replicate to do this for 309 participants. It then simulates 30% drop outs by choosing random participants and a random week, following which their measurements are all NA. It also adds in some random missing weeks for the rest of the participants. The result is pivoted into long format to allow for easier analysis.
library(tidyverse)
set.seed(1)
# Generate correlated scores for 309 people over 27 visits
df <- setNames(cbind(data.frame(ID = 1:309, t(replicate(309, {
x <- cumsum(rnorm(27, 0.05, 0.1))
round(100 * exp(x)/(1 + exp(x)))
})))), c('ID', paste0('Visit_', 1:27)))
# Model dropouts at 30% rate
dropout <- sample(c(TRUE, FALSE), 309, TRUE, prob = c(0.7, 0.3))
df[cbind(which(!dropout), sample(2:28, sum(!dropout), TRUE))] <- NA
df <- as.data.frame(t(apply(df, 1, function(x) ifelse(is.na(cumsum(x)), NA,x))))
# Add random missing visits
df[cbind(sample(309, 100, TRUE), sample(2:28, 100, TRUE))] <- NA
df <- pivot_longer(df, -ID, names_to = 'Week', values_to = 'Score') %>%
mutate(Week = 2 * (as.numeric(gsub('\\D+', '', Week)) - 1))
Our data frame now looks like this:
head(df)
#> # A tibble: 6 x 3
#> ID Week Score
#> <dbl> <dbl> <dbl>
#> 1 1 0 50
#> 2 1 2 51
#> 3 1 4 51
#> 4 1 6 56
#> 5 1 8 58
#> 6 1 10 57
And we can see the scores drift upward over time (since we set a small positive mu on our rnorm when creating the scores.
lm(Score ~ Week, data = df)
#>
#> Call:
#> lm(formula = Score ~ Week, data = df)
#>
#> Coefficients:
#> (Intercept) Week
#> 52.2392 0.5102
We can plot and see the overall shape of the scores and their spread:
ggplot(df, aes(Week, Score, group = ID)) + geom_line(alpha = 0.1)
Created on 2023-01-31 with reprex v2.0.2
Related
I'm a little new to R and was hoping to get some insight about how to calculate for any percentile, for example 5th, 15th , etc...
The data I'm working with has two columns:
salary: (datatype is numeric / double)
student (set up as factor / integer, but only has yes/no)
I've already used:
favstats(salary~student, data=Default, na.rm=TRUE)
to get the two rows of stats broken down by whether they're a student or not; however, I'm not sure how to have the output show me a percentile of my choosing.
Would love to know the simplest way to go on about this in R Studio.
Thank you!
The quantile() function in base r does this.
x <- rnorm(100)
percentiles <- c(0.05, 0.15)
quantile(x, percentiles)
#> 5% 15%
#> -1.593506 -1.120130
If you need to produce a more complex summary table, you can do something with {tidyverse} like this:
library(tidyverse)
n <- 50
d <- tibble(student = rep(c(T, F), each = n),
salary = c(rnorm(n, 75, 10), rnorm(n, 95, 15)))
d %>%
group_by(student) %>%
summarize(quantile_salary = quantile(salary, percentiles))
#> # A tibble: 4 × 2
#> # Groups: student [2]
#> student quantile_salary
#> <lgl> <dbl>
#> 1 FALSE 70.0
#> 2 FALSE 78.5
#> 3 TRUE 57.3
#> 4 TRUE 64.0
Created on 2022-09-22 by the reprex package (v2.0.1)
I am having trouble trying to apply a custom function to multiple groups within a data frame and mutate it to the original data. I am trying to calculate the percent inhibition for each row of data (each observation in the experiment has a value). The challenging issue is that the function needs the mean of two different groups of values (positive and negative controls) and then uses that mean value in each calculation.
In other words, the mean of the negative control is subtracted by the experimental value, then divided by the mean of the negative control minus the positive control.
Each observation including the + and - controls should have a calculated percent inhibition, and as a double check, for each experiment(grouping) the
mean of the pct inhib of the - controls should be around 0 and the + controls around 100.
The function:
percent_inhibition <- function(uninhibited, inhibited, unknown){
uninhibited <- as.vector(uninhibited)
inhibited <- as.vector(inhibited)
unknown <- as.vector(unknown)
mu_u <- mean(uninhibited, na.rm = TRUE)
mu_i <- mean(inhibited, na.rm = TRUE)
percent_inhibition <- (mu_u - unknown)/(mu_u - mu_i)*100
return(percent_inhibition)
}
I have a data frame with multiple variables: target, box, replicate, and sample type. I am able to do the calculation by subsetting the data (below), (1 target, box, and replicate) but have not been able to figure out the right way to apply it to all of the data.
subset <- data %>%
filter(target == "A", box == "1", replicate == 1)
uninhib <-
subset$value[subset$sample == "unihib"]
inhib <-
subset$value[subset$sample == "inhib"]
pct <- subset %>%
mutate(pct = percent_inhibition(uninhib, inhib, .$value))
I have tried group_by and do, and nest functions, but my knowledge is lacking in how to apply these functions to my subsetting problem. I'm stuck when it comes to the subset of the subset (calculating the means) and then applying that to the individual values. I am hoping there is an elegant way to do this without all of the subsetting, but I am at a loss on how.
I have tried:
inhibition <- data %>%
group_by(target, box, replicate) %>%
mutate(pct = (percent_inhibition(.$value[.$sample == "uninhib"], .$value[.$sample == "inhib"], .$value)))
But get the error that columns are not the right length, because of the group_by function.
library(tidyr)
library(purrr)
library(dplyr)
data %>%
group_by(target, box, replicate) %>%
mutate(pct = {
x <- split(value, sample)
percent_inhibition(x$uninhib, x$inhib, value)
})
#> # A tibble: 10,000 x 6
#> # Groups: target, box, replicate [27]
#> target box replicate sample value pct
#> <chr> <chr> <int> <chr> <dbl> <dbl>
#> 1 A 1 3 inhib -0.836 1941.
#> 2 C 1 1 uninhib -0.221 -281.
#> 3 B 3 2 inhib -2.10 1547.
#> 4 C 1 1 uninhib -1.67 -3081.
#> 5 C 1 3 inhib -1.10 -1017.
#> 6 A 2 1 inhib -1.67 906.
#> 7 B 3 1 uninhib -0.0495 -57.3
#> 8 C 3 2 inhib 1.56 5469.
#> 9 B 3 2 uninhib -0.405 321.
#> 10 B 1 2 inhib 0.786 -3471.
#> # … with 9,990 more rows
Created on 2019-03-25 by the reprex package (v0.2.1)
Or:
data %>%
group_by(target, box, replicate) %>%
mutate(pct = percent_inhibition(value[sample == "uninhib"],
value[sample == "inhib"], value))
With data as:
n <- 10000L
set.seed(123) ; data <-
tibble(
target = sample(LETTERS[1:3], n, replace = TRUE),
box = sample(as.character(1:3), n, replace = TRUE),
replicate = sample(1:3, n, replace = TRUE),
sample = sample(c("inhib", "uninhib"), n, replace = TRUE),
value = rnorm(n)
)
I have data giving me the percentage of people in some groups who have various levels of educational attainment:
df <- data_frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10))
df
# A tibble: 2 x 5
group no.highschool high.school college graduate
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 20. 70. 10. 0.
2 B 10. 40. 40. 10.
E.g., in group A 70% of people have a high school education.
I want to generate 4 variables that give me the proportion of people in each group with less than each of the 4 levels of education (e.g., lessthan_no.highschool, lessthan_high.school, etc.).
desired df would be:
desired.df <- data.frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10),
lessthan_no.highschool = c(0,0),
lessthan_high.school = c(20, 10),
lessthan_college = c(90, 50),
lessthan_graduate = c(100, 90))
In my actual data I have many groups and a lot more levels of education. Of course I could do this one variable at a time, but how could I do this programatically (and elegantly) using tidyverse tools?
I would start by doing something like a mutate_at() inside of a map(), but where I get tripped up is that the list of variables being summed is different for each of the new variables. You could pass in the list of new variables and their corresponding variables to be summed as two lists to a pmap(), but it's not obvious how to generate that second list concisely. Wondering if there's some kind of nesting solution...
Here is a base R solution. Though the question asks for a tidyverse one, considering the dialog in the comments to the question I have decided to post it.
It uses apply and cumsum to do the hard work. Then there are some cosmetic concerns before cbinding into the final result.
tmp <- apply(df[-1], 1, function(x){
s <- cumsum(x)
100*c(0, s[-length(s)])/sum(x)
})
rownames(tmp) <- paste("lessthan", names(df)[-1], sep = "_")
desired.df <- cbind(df, t(tmp))
desired.df
# group no.highschool high.school college graduate lessthan_no.highschool
#1 A 20 70 10 0 0
#2 B 10 40 40 10 0
# lessthan_high.school lessthan_college lessthan_graduate
#1 20 90 100
#2 10 50 90
how could I do this programatically (and elegantly) using tidyverse tools?
Definitely the first step is to tidy your data. Encoding information (like edu level) in column names is not tidy. When you convert education to a factor, make sure the levels are in the correct order - I used the order in which they appeared in the original data column names.
library(tidyr)
tidy_result = df %>% gather(key = "education", value = "n", -group) %>%
mutate(education = factor(education, levels = names(df)[-1])) %>%
group_by(group) %>%
mutate(lessthan_x = lag(cumsum(n), default = 0) / sum(n) * 100) %>%
arrange(group, education)
tidy_result
# # A tibble: 8 x 4
# # Groups: group [2]
# group education n lessthan_x
# <chr> <fct> <dbl> <dbl>
# 1 A no.highschool 20 0
# 2 A high.school 70 20
# 3 A college 10 90
# 4 A graduate 0 100
# 5 B no.highschool 10 0
# 6 B high.school 40 10
# 7 B college 40 50
# 8 B graduate 10 90
This gives us a nice, tidy result. If you want to spread/cast this data into your un-tidy desired.df format, I would recommend using data.table::dcast, as (to my knowledge) the tidyverse does not offer a nice way to spread multiple columns. See Spreading multiple columns with tidyr or How can I spread repeated measures of multiple variables into wide format? for the data.table solution or an inelegant tidyr/dplyr version. Before spreading, you could create a key less_than_x_key = paste("lessthan", education, sep = "_").
Alright, I'm waving my white flag.
I'm trying to compute a loess regression on my dataset.
I want loess to compute a different set of points that plots as a smooth line for each group.
The problem is that the loess calculation is escaping the dplyr::group_by function, so the loess regression is calculated on the whole dataset.
Internet searching leads me to believe this is because dplyr::group_by wasn't meant to work this way.
I just can't figure out how to make this work on a per-group basis.
Here are some examples of my failed attempts.
test2 <- test %>%
group_by(CpG) %>%
dplyr::arrange(AVGMOrder) %>%
do(broom::tidy(predict(loess(Meth ~ AVGMOrder, span = .85, data=.))))
> test2
# A tibble: 136 x 2
# Groups: CpG [4]
CpG x
<chr> <dbl>
1 cg01003813 0.781
2 cg01003813 0.793
3 cg01003813 0.805
4 cg01003813 0.816
5 cg01003813 0.829
6 cg01003813 0.841
7 cg01003813 0.854
8 cg01003813 0.866
9 cg01003813 0.878
10 cg01003813 0.893
This one works, but I can't figure out how to apply the result to a column in my original dataframe. The result I want is column x. If I apply x as a column in a separate line, I run into issues because I called dplyr::arrange earlier.
test2 <- test %>%
group_by(CpG) %>%
dplyr::arrange(AVGMOrder) %>%
dplyr::do({
predict(loess(Meth ~ AVGMOrder, span = .85, data=.))
})
This one simply fails with the following error.
"Error: Results 1, 2, 3, 4 must be data frames, not numeric"
Also it still isn't applied as a new column with dplyr::mutate
fems <- fems %>%
group_by(CpG) %>%
dplyr::arrange(AVGMOrder) %>%
dplyr::mutate(Loess = predict(loess(Meth ~ AVGMOrder, span = .5, data=.)))
This was my fist attempt and mostly resembles what I want to do. Problem is that this one performs the loess prediction on the entire dataframe and not on each CpG group.
I am really stuck here. I read online that the purr package might help, but I'm having trouble figuring it out.
data looks like this:
> head(test)
X geneID CpG CellLine Meth AVGMOrder neworder Group SmoothMeth
1 40 XG cg25296477 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.81107210 1 1 5 0.7808767
2 94 XG cg01003813 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.97052120 1 1 5 0.7927130
3 148 XG cg13176022 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.06900448 1 1 5 0.8045080
4 202 XG cg26484667 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.84077890 1 1 5 0.8163997
5 27 XG cg25296477 iPS__HDF51IPS6_passage33_Female____157.647.1.2 0.81623880 2 2 3 0.8285259
6 81 XG cg01003813 iPS__HDF51IPS6_passage33_Female____157.647.1.2 0.95569240 2 2 3 0.8409501
unique(test$CpG)
[1] "cg25296477" "cg01003813" "cg13176022" "cg26484667"
So, to be clear, I want to do a loess regression on each unique CpG in my dataframe, apply the resulting "regressed y axis values" to a column matching the original y axis values (Meth).
My actual dataset has a few thousand of those CpG's, not just the four.
https://docs.google.com/spreadsheets/d/1-Wluc9NDFSnOeTwgBw4n0pdPuSlMSTfUVM0GJTiEn_Y/edit?usp=sharing
This is a neat Tidyverse way to make it work:
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
models <- fems %>%
tidyr::nest(-CpG) %>%
dplyr::mutate(
# Perform loess calculation on each CpG group
m = purrr::map(data, loess,
formula = Meth ~ AVGMOrder, span = .5),
# Retrieve the fitted values from each model
fitted = purrr::map(m, `[[`, "fitted")
)
# Apply fitted y's as a new column
results <- models %>%
dplyr::select(-m) %>%
tidyr::unnest()
# Plot with loess line for each group
ggplot(results, aes(x = AVGMOrder, y = Meth, group = CpG, colour = CpG)) +
geom_point() +
geom_line(aes(y = fitted))
You may have already figured this out -- but if not, here's some help.
Basically, you need to feed the predict function a data.frame (a vector may work too but I didn't try it) of the values you want to predict at.
So for your case:
fems <- fems %>%
group_by(CpG) %>%
arrange(CpG, AVGMOrder) %>%
mutate(Loess = predict(loess(Meth ~ AVGMOrder, span = .5, data=.),
data.frame(AVGMOrder = seq(min(AVGMOrder), max(AVGMOrder), 1))))
Note, loess requires a minimum number of observations to run (~4? I can't remember precisely). Also, this will take a while to run so test with a slice of your data to make sure it's working properly.
Unfortunately, the approaches described above did not work in my case. Thus, I implemented the Loess prediction into a regular function, which worked very well. In the example below, the data is contained in the df data frame while we group by df$profile and want to fit the Loess prediction into the df$daily_sum values.
# Define important variables
span_60 <- 60/365 # 60 days of a year
span_365 <- 365/365 # a whole year
# Group and order the data set
df <- as.data.frame(
df %>%
group_by(profile) %>%
arrange(profile, day) %>%
)
)
# Define the Loess function. x is the data frame that has to be passed
predict_loess <- function(x) {
# Declare that the loess column exists, but is blank
df$loess_60 <- NA
df$loess_365 <- NA
# Identify all unique profilee IDs
all_ids <- unique(x$profile)
# Iterate through the unique profilee IDs, determine the length of each vector (which should correspond to 365 days)
# and isolate the according rows that belong to the profilee ID.
for (i in all_ids) {
len_entries <- length(which(x$profile == i))
queried_rows <- result <- x[which(x$profile == i), ]
# Run the loess fit and write the result to the according column
fit_60 <- predict(loess(daily_sum ~ seq(1, len_entries), data=queried_rows, span = span_60))
fit_365 <- predict(loess(daily_sum ~ seq(1, len_entries), data=queried_rows, span = span_365))
x[which(x$profile == i), "loess_60"] <- fit_60
x[which(x$profile == i), "loess_365"] <- fit_365
}
# Return the initial data frame
return(x)
}
# Run the Loess prediction and put the results into two columns - one for a short and one for a long time span
df <- predict_loess(df)
This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 6 years ago.
I have imported a CSV with 3 columns , 2 columns for Y and X and the third column which identifies the category for X ( I have 20 groups/categories). I am able to run a regression at overall level but I want to run regression for the 20 categories separately and store the co-efs.
I tried the following :
list2env(split(sample, sample$CATEGORY_DESC), envir = .GlobalEnv)
Now I have 20 files, how do I run a regression on these 20 files and store the co-effs somewhere.
Since no data was provided, I am generating some sample data to show how you can run multiple regressions and store output using dplyr and broom packages.
In the following, there are 20 groups and different x/y values per group. 20 regressions are run and output of these regressions is provided as a data frame:
library(dplyr)
library(broom)
df <- data.frame(group = rep(1:20, 10),
x = rep(1:20, 10) + rnorm(200),
y = rep(1:20, 10) + rnorm(200))
df %>% group_by(group) %>% do(tidy(lm(x ~ y, data = .)))
Sample output:
Source: local data frame [40 x 6]
Groups: group [20]
group term estimate std.error statistic p.value
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 (Intercept) 0.42679228 1.0110422 0.4221310 0.684045203
2 1 y 0.45625124 0.7913256 0.5765657 0.580089051
3 2 (Intercept) 1.99367392 0.4731639 4.2134955 0.002941805
4 2 y 0.05101438 0.1909607 0.2671460 0.796114398
5 3 (Intercept) 3.14391308 0.8417638 3.7349114 0.005747126
6 3 y 0.08418715 0.2453441 0.3431391 0.740336702
Quick solution with lmList (package nlme):
library(nlme)
lmList(x ~ y | group, data=df)
Call:
Model: x ~ y | group
Data: df
Coefficients:
(Intercept) y
1 0.4786373 0.04978624
2 3.5125369 -0.94751894
3 2.7429958 -0.01208329
4 -5.2231576 2.24589181
5 5.6370824 -0.24223131
6 7.1785581 -0.08077726
7 8.2060808 -0.18283134
8 8.9072851 -0.13090764
9 10.1974577 -0.18514527
10 6.0687105 0.37396911
11 9.0682622 0.23469187
12 15.1081915 -0.29234452
13 17.3147636 -0.30306692
14 13.1352411 0.05873189
15 6.4006623 0.57619151
16 25.4454182 -0.59535396
17 22.0231916 -0.30073768
18 27.7317267 -0.54651597
19 10.9689733 0.45280604
20 23.3495704 -0.14488522
Degrees of freedom: 200 total; 160 residual
Residual standard error: 0.9536226
Borrowed the data df from #Gopala answer.
Consider also a base solution with lapply():
regressionList <- lapply(unique(df$group),
function(x) lm(x ~ y, df[df$group==x,]))
And only the coefficients:
coeffList <- lapply(unique(df$group),
function(x) lm(x ~ y, df[df$group==x,])$coefficients)
Even list of summaries:
summaryList <- lapply(unique(df$group),
function(x) summary(lm(x ~ y, df[df$group==x,])))