Creating a counterfactual group for missing values NAs - r

I have a data frame with 17497 observations of 1681 variables that I am working with using R. Some variables are nominal, some are ordinal, some numeric, etc.
I am concentrating on the one that stands for net recalled salary from the previous month (dataframe$q31, where q31 simply denotes question 31 from a questionnaire). The variable is numeric.
It happens so that there are many missing values denoted as NAs. People with managerial and professional positions tend to be more likely not to reveal their income. At the same time there are more likely to earn more. Hence, my further analysis might be distorted.
I would like to create another column with the net recalled salary where the NAs are substituted not with an average, but with the number a given person would be most likely to give taken into consideration preferably all other characteristics from the dataframe.
If not possible, at least its:
profession (q22isc27, ordinal)
years of experience (q24c, numeric)
age (q9age, numeric)
sex (q8, 1- men, 2 - women)
year when surveyed (pgssyear, numeric)
years of education (problematic: for all years q131ed variable available that was filled in by a surveyor itself and is highly approximate, additionally it needs to be recoded into numeric, as somehow it gets displayed as nominal in R; since 1999 q131edr is available that was filled in by surveyed themselves and is ordinal (in spss gets displayed as "scale")
marital status (q21, ordinal)
ownership status of the company where employed (q46e, ordinal)
hours worked per week (q21, numeric)
weight variable (weight, numeric: it depicts "representativeness" of a person in respect to the whole population) (!)
if possible also region where the respondent lives, but till 1999 there were 49 districts in Poland and afterwards 16, hence there are two variables: voiev49 and voiev16 that are coded as NAs for the invalid years.
I think it might be related to propensity score matching or to these packages that I found online: http://cran.r-project.org/web/packages/optmatch/optmatch.pdf
Is there any magical way to do it in R?

It seems I will be able to handle it using Amelia package:
http://cran.r-project.org/web/packages/Amelia/vignettes/amelia.pdf#subsection.4.4
http://cran.r-project.org/web/packages/Amelia/Amelia.pdf
and true, there is lots of materials on it on Cross Validated, e.g.
https://stats.stackexchange.com/questions/95832/missing-values-nas-in-the-test-data-when-using-predict-lm-in-r
#nograpes, thank you for all the hints!

Related

How do I count instances of a categorical variable for each instance of another categorical variable?

Disclaimer: I can't include data because it's confidential student data.
I have an R dataframe "data" with a column "StateResidence" for what state the student is from, and a column "Enrolled" 0 or a 1 that tells whether or not they enrolled in the school I go to.
I'm trying to make a dataframe with three columns: Column 1 should list out each of the 69 unique States listed in Data (I've already done this one), and column 2 should show how many students from that state are enrolled, and column 3 should show what percentage of the total students from that state were enrolled.
The reason for this is so I can do some exploratory data analysis by plotting barplots with the number on the Y axis and the state on the X axis to analyze enrollment trends geographically.
I really don't have much else to include - I'm completely lost here, and I'm not very familiar with R. Any help is greatly appreciated, even just some helpful functions or something. Thank you.

Relation between one dummy variable and another numerical variable

In a survey, I collected households with different income categories and the number of ACs that they use. I have to understand whether there is any pattern of income and number of ACs like households with more income using more ACs. I have given numerical codes to different income categories like up to USD200- 1, USD201 - USD400- 2 and so on and so forth till code 10 in excel. The number of ACs are also in numerical digits. NA has been given to households not aware of their income or number of ACs. Using R, how do I understand the impact of income on number of ACs?

How to normalize data with semesters vs trimesters when all mixed in dataset in R

I am new to R but finding it a powerful solution when working with education data across a state. I have grades for about 11,000 students over the span of two years. Most students 12 rows in my dataset, as most schools work on a semester system. Many schools, however, work on a trimester or quarter system, meaning there are more or less rows and, therefore, more or less grades. The grades are relatively close throughout each semester/trimester/whatever and I have already converted the letter grade into a numeric value. A column titled 'TERM' identifies which system the school is under (SEM1/2, TRI1/2/3, QTR1/2/3/4). I am wondering if anyone has an idea as to how best organize this data by TERM so I have something normalized.
df<- cbind(c('stu1', 'stu1', 'stu2','stu2','stu2'), c('sem1','sem2', 'tri1','tri2','tri3'), c('a','c','a','b','a'), c(4,2,4,3,4))

Transforming only part of the NA's of a variable in 0

I am working in a dataframe in RStudio trying to understand if there is a correlation between doing exercises and the general health of the person. There is three main variables:
exerof1: this variable is related to how many of times the people in the research exercised in the last 30 days.
exerany2: in this variable, the participants responded if they practiced exercises in the last month, therefore they can say yes, no or refuse to answer.
genhlth: a factor variable which split the observations in 5 levels.
I have already transformed the exeroft1 variable, but 30% of this variable are NA's and most of them are NA's because they answered "No" in the "exerany2" question.
My objective is to identificate the people who said "No" in the "exerany" variable and are listed in the exerof1 as "NAs" to transform those "NAs" in 0.
I don't know if my analysis is the best way because I am a beginner. I tried to do what I want using ifelse, but I am struggling. I also tried to check if there is another thread with the same question, but I coundn't find.
I will await for your feedback.
Assuming your data frame is called data:
data[(is.na(data$exerof1) & data$exerany2=="No"),"exerof1"] <- 0
Basically we select the rows the satisfy your condition, then pick the column exerof1, and asign those the value 0.

Should year variable be factor or numeric in panel data in R?

I have a panel dataset where hospitals are followed over time from 2004 to 2010 every two years. The data is in Stata but I take it to R. Initially the variables year (2004, 2006, 2008, 2010) and t (1=2004, 2=2006 and so on) are in integer but later I convert them into factors as follows:
data$year <- factor(data$year)
and similarly for t time variable as well.
But I am confused and my question is as to whether take year or t as an integer or numeric variable or convert it to factor for the panel data and whether the above command is the right way to convert into a factor?
Treating year as a categorical variable will calculate effect of each indivudal year - i.e. what impact on the target variable was in average in a given year. On the other hand, including t as numerical variable says what happens on average two years later. Given that there are just 4 time periods, the first approach seems more reasonable, but it really depends on the goal of our analysis.
The command should be
data$year <- as.factor(data$year).
Also, make sure that You include only one of year or t as including both could screw up the interpretation.

Resources