Code for multiple ranges of increasing numbers to determine representative number - r

I seek guidance on R code to make a probability calculation based on survey data. The survey asked respondents to select from five descriptions of how often they sought advice (the “frequency” variable in the data frame below). The number of respondents choosing each description is the “respondent” variable and the maximum I have assumed for each frequency is in the “max.year.est” variable.
use <- data.frame(frequency = c("rarely; less than once per quarter",
"very occasionally; about every other month",
"occasionally; about once per month",
"fairly frequently; more than once per month",
"very frequently; once a week or more"),
respondents = c(50, 40, 30, 20, 10),
max.year.est = c(3, 6, 12, 18, 78))
The dplyr call mutates three columns to the use data frame that each present an annual total of requests for advice by the respondent group after multiplying the respondents (i) by a maximum number for each range -- which makes assumptions about the top of a range in the two most frequent; (ii) by the mid-point of max.requests for a more reasonable intermediate assumption; and (iii) by 40% of mean.requests, a figure pulled out of a hat because it seems reasonable that lower numbers of requests will be more common in each range than higher numbers.
use %>% group_by(frequency) %>%
mutate(max.requests = respondents * max.year.est) %>%
mutate(mean.requests = 0.5*max.requests) %>%
mutate(lower.requests = 0.4*max.requests)
If we assume the annual requests for advice in each range are distributed in a reasonable pattern, with more respondents making requests at smaller numbers in a range and fewer as you move up the range, what is a statistical method and R code (Poisson distribution?) to arrive at a defensible number of total annual requests in each range given the assumptions above?
Thank you for your comments and answers.

Related

Creating Balance tables across different samples (modelsummary?)

I have a dataset from a survey that looks something like this.
library(dplyr)
library(modelsummary)
library(Hmisc)
set.seed(123)
df<- data.frame(var1=runif(1000),
var2=runif(1000),
var3=runif(1000),
particip=rbinom(100, size=1, p=0.1)) %>%
mutate(refusal=ifelse(particip==1,0,rbinom(100, size=1, p=0.06)),
sampled=ifelse(particip==1|refusal==1,"Sampled","Not Sampled")) %>%
arrange(desc(particip))
describe(df[,4:6])
My aim is to check for covariate balance (var1:var3) across different subset of the data.
particip are respondents that were sampled and replied to the survey
refusal are respondents that were sampled but did not respond to the survey
sampled are all the respondents that were sampled regardless of whether they refused or accepted to take the survey.
I would like to create a single balance table that compares the means of var1:var3 across multiple subsamples. In particular, I would like to compare the means for the whole universe of respondents (all 1000 possible respondents), to the means of the sampled respondents, to the means of the respondents that refused to take the survey to the means of the respondents that eventually took part in the survey.
I have tried using the function datasummary_balance from the package modelsummary but I am only able to compare the means of one group at a time (participated, sampled, or refusals). Instead, I would like to create a single table, will all of the means by these three different groups.
datasummary_balance(~particip,
fmt=3,
data=df,
output = "markdown")
datasummary_balance(~refusal,
fmt=3,
data=df,
output = "markdown")
datasummary_balance(~sampled,
fmt=3,
data=df,
output = "markdown")
If anyone know how to do this it would be of great help

How would I devise code to get both within subject and between subject comparisons when attempting to carry out a repeated measures ANOVA?

I understand I can use lmer but I would like to undertake a repeated measures anova in order to carry out both a within group and a between group analysis.
So I am trying to compare the difference in metabolite levels between three groups ( control, disease 1 and disease 2) over time ( measurements collected at two timepoints), and to also make a within group comparison, comparing time point 1 with time point 2.
Important to note - these are subjects sending in samples not timed trial visits where samples would have been taken on the same day or thereabouts. For instance time point 1 for one subject could be 1995, time point 1 for another subject 1996, the difference between timepoint 1 and timepoint 2 is also not consistent. There is an average of around 5 years, however max is 15, min is .5 years.
I have 43, 45, and 42 subjects respectively in each group. My response variable would be say metabolite 1, the predictor would be Group. I also have covariates I would like to be accounted for such as age, BMI, and gender. I would also need to account for family ID (which I have as a random effect in my lmer model). My column with Time has a 0 to mark the time point 1 and 1 is timepoint 2). I understand I must segregate the within and between subjects command, however, I am unsure how to do this. From my understanding so far;
If I am using the anova_test, my formula that needs to be specified for between subjects would be;
Metabolite1 ~ Group*Time
Whilst for within subjects ( seeing whether there is any difference within each group at TP1 vs TP2), I am unsure how I would specify this ( the below is not correct).
Metabolite1 ~ Time + Error(ID/Time)
The question is, how do I combine this altogether to specify the between and within subject comparisons I would like and accounting for the covariates such as gender, age and BMI? I am assuming if I specify covariates it will become an ANCOVA not an ANOVA?
Some example code that I found that had both a between and within subject comparison design (termed mixed anova).
aov1 <- aov(Recall~(Task*Valence*Gender*Dosage)+Error(Subject/(Task*Valence))+(Gender*Dosage),ex5)
Where he specifies that the within subject comparison is within the Error term. Also explained here https://rpkgs.datanovia.com/rstatix/reference/anova_test.html
However, mine, which I realise is very wrong currently ( is missing a correct within subject comparison).
repmes<-anova_test(data=mets, Metabolite1~ Group*Time + Error(ID/Time), covariate=c("Age", "BMI",
"Gender", "FamilyID")
I ultimately would like to determine from this with appropriate post hoc tests ( if p < 0.05) whether there are any significant differences in Metabolite 1 expression between groups between the two time points (i.e over time), and whether there are any significant differences between subjects comparing TP1 with TP2. Please can anybody help.

Finding values in one row, with a minimum of responses in another column. Then plot the mean increase per time period for those selected

The dataset is based on the growth of 250~ trees over 500 years. The columns are treeID, GrowthAmount. Because of the age of the data, the trees have varying amounts of measurements taken each year, and often no record for a given year.
I need to find the trees which have at least 350 measurements, then plot the mean growth for each year for only the selected trees against the year.
I have tried using select, subset, and many other things but my experience in r is very small, so I am looking for some help here
treedatalongsum = arrange(summarize(group_by(treedatalong, Tree), total = n()), desc(total))
treedatalongsum
tree%>% filter(Tree==1, 2, 3, 91,115, 116, 118, 102, 119, 121) %>% group_by(Tree)%>%summarize(mean(tree$Growth, na.rm = TRUE))
this was my original attempt and it did not work. I was trying to select the trees and have them labeled by tree ID and then show on the next column how much they grew on average
Thank you for your time and knowledge. I really appreciate it.

Find period with lowest variability in time series r

I have a time series and would like to find the period that has the lowest contiguous variability, i.e. the period in which the rolling SD hovers around the minimum for the longest consecutive time steps.
test=c(10,12,14,16,13,13,14,15,15,14,16,16,16,16,16,16,16,15,14,15,12,11,10)
rol=rollapply(x, width=4, FUN=sd)
rol
I can easily see from the data or the graph that the longest period with the lowest variability start at t=11. Is there a function that can help me find this period of continued low variability, perhaps trying automatically different size for the rolling window? I am not interested in finding the time step with the lowest SD, but a period where this low SD is more consistent than others.
All I can think for now is looking at the difference between rol[i]-rol[i+1], looping through the vector and use a counter to find periods of consecutive low values of SD. I was also thinking of using cluster analysis, something like kmeans(rol, 5) but I can have long time series which are complex and I would have to manually pick the number of clusters.

Determining percentile based on reference table

I have standardized normal values for heart rates and respiratory rates in children from a recent article. I copied them into a csv to use as a dataset in R. The data is simply different age ranges (ex 3 months to 6 months, or 1 year to 2 years) and then the heart rate for the 1st percentile, 10th, 25th, 50th, 75th, 90th, 99th percentile for that age range.
I want to compare a patients data with this reference table to tell me what percentile they are at. Since this is a perfectly normal distribution, I don't think its a very hard task to do, but its outside of my R scope and I can't seem to find any good information on how to accomplish this.
Based on what you explained i can suggest this simple function that will input the heart rate and the age range of your patient and return the %percentile based on a normal density of this specific range.
my.quantile = function(myrange, heart.rate){
table <- data.frame('range'= c("range1", "range2", "range3"),
'mean' = c(120, 90, 60),
'sd' = c(12, 15, 30))
res <- pnorm(q = heart.rate,
mean = subset(table, range==myrange)$mean,
sd = subset(table, range==myrange)$sd)
return(res*100)
}
### my.quantile("range1", 140)
### [1] 95.22096
From what you say if it is perfectly normal you just need the mean and variance of each range right? You can adapt it for the respiratory rate.
EDIT: in order to retrieve the normal distribution parameters from your quantile table, given the hypothesis that the quantiles you've got are rather precise:
i/ Your mean paramater is exactly the 50th percentile
ii/ You find the standard deviation by taking any other percentile, for instance let's assume your 90th percentile is 73 beats and the 50th is 61 beats:
(73-61)/qnorm(0.90)
### [1] 9.36365
9.36 is your standard deviation. From here it shouldn't be very hard to automate it.
Note: if your percentile data are not very precise, you may want to repeat the operation for each percentile value and take the average.

Resources