GEE model appropriate - sample size calculation - r

I'm interested in the sample size calculation (preferably in R) for a study with the following properties:
Patients - currently under cancer specific therapy - should be asked weekly (up to one year [52 times]; alternativley up to 6 months [26 times]) about infection related symptoms (e.g. fever, pain,...). These symptoms are assessed with a 15 item self-report questionnaire (4 levels).
Patients will come to the clinic regularly or when an acute situation arises. Here, objective parameters (e.g. blood samples) are collected, with the help of which a statement can be made as to whether an infection is present/existed or not (variable infection: yes/no).
It is expected, that on average patients report about 3-5 infections during cancer specific therapy within one year.
Example scenario:
Patient X answers the 15 item questionnaire each week: In week 5 he/she reports having fever, which might be an indicator for an ongoing infection. In week 6 the patient comes to a regular visit to the hospital. The collected blood sample verifies that the patient had an infection the week before.
In week 14 the patient reports in the questionnaire having some pain in the chest, he/she visits the physician in order to check if there is an infection. Blood samples do not validated an infection.
Statistical analyses:
Describing these scenarios, I am interested in the information provided by each of the 15 items, correctly predicting if an infections is about to occur or already ongoing.
I think in this scenario a Generalized Estimating Equations (GEE) model might be appropriate (compared to ROC/AUC analyses), since I have to account for multiple measurements within each patient and to account for the fact, that some infections can occur longer than one week.
This means for example if patient Y has an infection ranging from week 5 to 6 it should not be counted as two separate infections.
Questions
Is a GEE model appropriate to answer if one or more of the 15 questionnaire items is predictive if an infection is about to occur or already ongoing.
Is it possible to differentiate between the predictive value of these items.
How can I conduct a sample size calculation for the scenario described above (preverably in R)
Thank you very much!

Related

How to set up a time series for this project (r)?

I am a cross country runner on a high school team, and I am using my limited knowledge of R and linear algebra to create a ranking index for xc teams.
I get my data from milesplit.com, but I am unsure if I am formatting this data properly. So far I created matrices for each race, with odd columns including runner score and even columns including time, where each team has a team_score and team_time column. I want to analyze growth of teams in a time series, but I have two questions about this:
(1): can I combine all of these "race matrices" into a time series? Can I assign all the data in a race matrix a certain date, then make one big time series including all 25 race matrices I made?
(2): Am I closing myself off to insights by not including name and grade for each runner (as I only record time and score)? If so, how can I write a matrix that contains all this information?

Diff-in-Diff estimation in R where all IDs are treated but at different points in time

I am trying to run a diff-in-diff on a dataset at the person-day level, where all individuals in the dataset are treated, albeit at different points in time. There are 5 treatment dates, so, for instance, person X receives the treatment on day 1, person Y receives the treatment on day 10, person Z on day 5, and so forth. What's important here is that every person is treated eventually. Here's a stylized visual representation of the data (where LHS is the dependent variable):
Now, what I am trying to do is run a diff-in-diff where I compare person Z that was treated on day 5 with person Y that was not yet treated on day 5 (so, in this setup, person Y would serve as the control group). This criteria would have to be extended to all the individuals in the sample so as to run a diff-in-diff simultaneously for all people.
I am not sure how to code this up in R. I am pretty familiar with the feols function in R as I have used it several times in the past to run conventional diff-in-diffs such as the one illustrated here: https://lost-stats.github.io/Model_Estimation/Research_Design/event_study.html. However, in this particular case, I am not sure what I should be interacting Days_To_Treatment with since if I interact with Treatment every observation prior to Days_To_Treatment = 0 will be dropped.
I am honestly pretty clueless as to how to approach this at the moment. Any help, advice, or tip would be greatly appreciated.
Thanks!

Relation between one dummy variable and another numerical variable

In a survey, I collected households with different income categories and the number of ACs that they use. I have to understand whether there is any pattern of income and number of ACs like households with more income using more ACs. I have given numerical codes to different income categories like up to USD200- 1, USD201 - USD400- 2 and so on and so forth till code 10 in excel. The number of ACs are also in numerical digits. NA has been given to households not aware of their income or number of ACs. Using R, how do I understand the impact of income on number of ACs?

Creating a counterfactual group for missing values NAs

I have a data frame with 17497 observations of 1681 variables that I am working with using R. Some variables are nominal, some are ordinal, some numeric, etc.
I am concentrating on the one that stands for net recalled salary from the previous month (dataframe$q31, where q31 simply denotes question 31 from a questionnaire). The variable is numeric.
It happens so that there are many missing values denoted as NAs. People with managerial and professional positions tend to be more likely not to reveal their income. At the same time there are more likely to earn more. Hence, my further analysis might be distorted.
I would like to create another column with the net recalled salary where the NAs are substituted not with an average, but with the number a given person would be most likely to give taken into consideration preferably all other characteristics from the dataframe.
If not possible, at least its:
profession (q22isc27, ordinal)
years of experience (q24c, numeric)
age (q9age, numeric)
sex (q8, 1- men, 2 - women)
year when surveyed (pgssyear, numeric)
years of education (problematic: for all years q131ed variable available that was filled in by a surveyor itself and is highly approximate, additionally it needs to be recoded into numeric, as somehow it gets displayed as nominal in R; since 1999 q131edr is available that was filled in by surveyed themselves and is ordinal (in spss gets displayed as "scale")
marital status (q21, ordinal)
ownership status of the company where employed (q46e, ordinal)
hours worked per week (q21, numeric)
weight variable (weight, numeric: it depicts "representativeness" of a person in respect to the whole population) (!)
if possible also region where the respondent lives, but till 1999 there were 49 districts in Poland and afterwards 16, hence there are two variables: voiev49 and voiev16 that are coded as NAs for the invalid years.
I think it might be related to propensity score matching or to these packages that I found online: http://cran.r-project.org/web/packages/optmatch/optmatch.pdf
Is there any magical way to do it in R?
It seems I will be able to handle it using Amelia package:
http://cran.r-project.org/web/packages/Amelia/vignettes/amelia.pdf#subsection.4.4
http://cran.r-project.org/web/packages/Amelia/Amelia.pdf
and true, there is lots of materials on it on Cross Validated, e.g.
https://stats.stackexchange.com/questions/95832/missing-values-nas-in-the-test-data-when-using-predict-lm-in-r
#nograpes, thank you for all the hints!

Independent binary variable (frequency) and continuous response variable - lmm

I've spent a lot of time searching for a solution but not successfully. For that reason I decided to post my problem or question here hoping somebody of you can help me.
I want to find out which variables are influencing the travel distance of two animals (same species).
The response variable is distance moved (in meters). In total I have 66 tracking sessions for both animals.
The independent variables are: temperature, rainfall, offspring (yes = 1, no = 0), observation period (in minutes) and activity.
I looked at the animals (one day - one animal) every 15 minutes and noted the state of activity (active = 1 or inactive = 0). For that reason my data table consists around 1800 points and the same amount of activity records.
Then I created a table with following columns:
Animal, Tracking-Session, rainfall, offspring, observation period, active, inactive, distance
The two columns active and inactive contain the sum of active (inactive) records per tracking session.
For example in tracking-session 1 the animal A was 30 times active and 11 inactive and moved 6000 meters during that tracking session.
I thought I could do my analysis with this table using the command cbind() to make one column for activity out of the two columns with "inactive" and "active". But this does not work, I get:
Error in lme4::lFormula(formula = distance~ (1 | animal) + activity + offspring + ...
rank of X = 12 < ncol(X) = 13
I want to include the second animal as a random factor to get an output valid for the whole "population" (which only consits of two animals in that case).
How can I fit a linear mixed model to this data or the first question is: how my data table has to look like to do such analysis?
I started running a linear mixed model with my original data table consisting of 1800 rows but the outcome was not convincing. And I don't know if this table was built up correctly for this task. Because I have only 60 tracking sessions and for that reason only 60 resulting travel distances, but 1800 records of activity (each 15 minutes - active or inactive). I don't know how to handle this situation the only possibility for me to overcome this problem was to copy the travel distace (which is the result of all points watched per day) and assign it to each single point of that tracking session.
The same is for rainfall and temperature because these conditions were only measured once a day I had to copy the value for each single point taken on the same day.
Is this correct or better can R handle such tables (like in the picture)? Or is it better to create a table with one row for each day (as I describe above)?
If the the second table (the one with one row per tracking session) is the better choice, how has it be transformed that R can use it?
Hopefully you can follow my explanations (I tried to explain it as detailed as possible) and anyone can help me!
Thanks in advance!
Iris

Resources