Relation between one dummy variable and another numerical variable - r

In a survey, I collected households with different income categories and the number of ACs that they use. I have to understand whether there is any pattern of income and number of ACs like households with more income using more ACs. I have given numerical codes to different income categories like up to USD200- 1, USD201 - USD400- 2 and so on and so forth till code 10 in excel. The number of ACs are also in numerical digits. NA has been given to households not aware of their income or number of ACs. Using R, how do I understand the impact of income on number of ACs?

Related

How to create a frequency table in R

I have a repeated measures of BMI over 50K people who took BMI measurements on different timepoints. What I want to know is the number of people who took only 1 measurement as well as for only 2 measurements and onwards. How get it in a table with cumulative frequency like PROC FREQ from SAS?
I don't know how to do to get those mutually exclusive numbers in a table format.

GEE model appropriate - sample size calculation

I'm interested in the sample size calculation (preferably in R) for a study with the following properties:
Patients - currently under cancer specific therapy - should be asked weekly (up to one year [52 times]; alternativley up to 6 months [26 times]) about infection related symptoms (e.g. fever, pain,...). These symptoms are assessed with a 15 item self-report questionnaire (4 levels).
Patients will come to the clinic regularly or when an acute situation arises. Here, objective parameters (e.g. blood samples) are collected, with the help of which a statement can be made as to whether an infection is present/existed or not (variable infection: yes/no).
It is expected, that on average patients report about 3-5 infections during cancer specific therapy within one year.
Example scenario:
Patient X answers the 15 item questionnaire each week: In week 5 he/she reports having fever, which might be an indicator for an ongoing infection. In week 6 the patient comes to a regular visit to the hospital. The collected blood sample verifies that the patient had an infection the week before.
In week 14 the patient reports in the questionnaire having some pain in the chest, he/she visits the physician in order to check if there is an infection. Blood samples do not validated an infection.
Statistical analyses:
Describing these scenarios, I am interested in the information provided by each of the 15 items, correctly predicting if an infections is about to occur or already ongoing.
I think in this scenario a Generalized Estimating Equations (GEE) model might be appropriate (compared to ROC/AUC analyses), since I have to account for multiple measurements within each patient and to account for the fact, that some infections can occur longer than one week.
This means for example if patient Y has an infection ranging from week 5 to 6 it should not be counted as two separate infections.
Questions
Is a GEE model appropriate to answer if one or more of the 15 questionnaire items is predictive if an infection is about to occur or already ongoing.
Is it possible to differentiate between the predictive value of these items.
How can I conduct a sample size calculation for the scenario described above (preverably in R)
Thank you very much!

How do I create a different data set from existing data set with only certain variables and values that I need?

So, I have this data set where I have age of chicks (bird chicks) from day 2 to day 10 (2,4,6,8,10) and I have a mass data for each of them on 2,4,6,8 and 10 days. But, not all chicks survive till day 10. So how do I extract a datasheet in R, using the overall datasheet but get only those individuals that have values for each of those days for the mass. And if I also wanted to sort them by Mass and Tarsus. Data set of those that have values for both variables on those days.

Delphi study analysis, how to calculate outcome specific stakeholder score frequency

I am analysing data from a Delphi study and I need to create a vector of the frequency of each score (1:10) for each stakeholder group (6 groups, total of 73 participants) for each outcome (48). The data is in the form:
I would like to create a vector similar to:
score 1,2,3,4,5,6,7,8,9
trialists<-c(0,0,0,0,28.6,71.4,0,0,0)
Where it is expressed as a percentage of a stakeholder group (e.g. trialists) that have scored each score for each outcome . I need to excluded a score of 10 as it represents "unable to answer".
This will result in 48 vectors for each of the 6 stakeholder groups.
Is there a elegant way to do this on R rather than just plodding through the data on excel and inputting it manually?

Creating a counterfactual group for missing values NAs

I have a data frame with 17497 observations of 1681 variables that I am working with using R. Some variables are nominal, some are ordinal, some numeric, etc.
I am concentrating on the one that stands for net recalled salary from the previous month (dataframe$q31, where q31 simply denotes question 31 from a questionnaire). The variable is numeric.
It happens so that there are many missing values denoted as NAs. People with managerial and professional positions tend to be more likely not to reveal their income. At the same time there are more likely to earn more. Hence, my further analysis might be distorted.
I would like to create another column with the net recalled salary where the NAs are substituted not with an average, but with the number a given person would be most likely to give taken into consideration preferably all other characteristics from the dataframe.
If not possible, at least its:
profession (q22isc27, ordinal)
years of experience (q24c, numeric)
age (q9age, numeric)
sex (q8, 1- men, 2 - women)
year when surveyed (pgssyear, numeric)
years of education (problematic: for all years q131ed variable available that was filled in by a surveyor itself and is highly approximate, additionally it needs to be recoded into numeric, as somehow it gets displayed as nominal in R; since 1999 q131edr is available that was filled in by surveyed themselves and is ordinal (in spss gets displayed as "scale")
marital status (q21, ordinal)
ownership status of the company where employed (q46e, ordinal)
hours worked per week (q21, numeric)
weight variable (weight, numeric: it depicts "representativeness" of a person in respect to the whole population) (!)
if possible also region where the respondent lives, but till 1999 there were 49 districts in Poland and afterwards 16, hence there are two variables: voiev49 and voiev16 that are coded as NAs for the invalid years.
I think it might be related to propensity score matching or to these packages that I found online: http://cran.r-project.org/web/packages/optmatch/optmatch.pdf
Is there any magical way to do it in R?
It seems I will be able to handle it using Amelia package:
http://cran.r-project.org/web/packages/Amelia/vignettes/amelia.pdf#subsection.4.4
http://cran.r-project.org/web/packages/Amelia/Amelia.pdf
and true, there is lots of materials on it on Cross Validated, e.g.
https://stats.stackexchange.com/questions/95832/missing-values-nas-in-the-test-data-when-using-predict-lm-in-r
#nograpes, thank you for all the hints!

Resources