Dealing with Missing Values for one Variable in R - r

I'm currently dealing with a data set that has missing values, but they are only missing for one single variable. I was trying to determine whether they are missing at random, so that I can simply remove them from the data frame. Hence, I am trying to find potential correlations between the NA's in the data frame and the values of the other variables. I found the following code online:
library("VIM")
data(sleep)
x <- as.data.frame(abs(is.na(sleep)))
head(sleep)
head(x)
y <- x[which(sapply(x, sd) > 0)]
cor(y)
However, this only shows you how the missing values themselves are correlated, in case there are distributed across all variables.
Is there a way to find not the correlation between the missing values in a data frame, but the correlation between the missing values of one variable and values of another variable? For example, if you have a survey which is optionally asking for family income, how could you determine whether the missing values are e.g. correlated with low income with R?

library(finalfit)
library(dplyr)
df <- data.frame(
A = c(1,2,4,5),
B = c(55,44,3,6),
C = c(NA, 4, NA, 5)
)
df %>%
missing_pairs("A", "C")

Related

Extract p values and r values for all pairwise variables

I have multiple variables for multiple countries over multiple years. I would like to generate a dataframe containing both an R^2 value and a P value for each pair of variables. I'm somewhat close, have a minimum working example and an idea of what the end product should look like, but am having some difficulties actually implementing it. If anyone could help, that would be most appreciated.
Please note, I would like to do this more manually than using packages like Hmisc as that has created a number of other issues. I'd had a look around for similar solutions as well, but havent had much luck.
# Code to generate minimum working example (country year pairs).
library(tidyindexR)
library(tidyverse)
library(dplyr)
library(reshape2)
# Function to generate minimum working example data
simulateCountryData = function(N=200, NEACH = 20, SEED=100){
variableOne<-rnorm(N,sample(1:100, NEACH),0.5)
variableOne[variableOne<0]<-0
variableTwo<-rnorm(N,sample(1:100, NEACH),0.5)
variableTwo[variableTwo<0]<-0
variableThree<-rnorm(N,sample(1:100, NEACH),0.5)
variableThree[variableTwo<0]<-0
geocodeNum<-factor(rep(seq(1,N/NEACH),each=NEACH))
year<-rep(seq(2000,2000+NEACH-1,1),N/NEACH)
# Putting it all together
AllData<-data.frame(geocodeNum,
year,
variableOne,
variableTwo,
variableThree)
return(AllData)
}
# This runs the function and generates the data
mySimData = simulateCountryData()
I have a reasonable idea of how to get correlations (both p values and r values) between 2 manually selected variables, but am having some trouble implementing it on the entire dataset and on a country level (rather than all at once).
# Example pvalue
corrP = cor.test(spreadMySimData$variableOne,spreadMySimData$variableTwo)$p.value
# Examplwe r value
corrEst = cor(spreadMySimData$variableOne,spreadMySimData$variableTwo)
Finally, the end result should look something like this :
myVariables = colnames(spreadMySimData[3:ncol(spreadMySimData)])
myMatrix = expand.grid(myVariables,myVariables)
# I'm having trouble actually trying to get the r values and p values in the dataframe
myMatrix = as.data.frame(myMatrix)
myMatrix$Pval = runif(9,0.01,1)
myMatrix$Rval = runif(9,0.2,1)
myMatrix
Thanks again :)
This will compute r and p for all the unique pairs.
# matrix of unique pairs coded as numeric
mx_combos <- combn(1:length(myVariables), 2)
# list of unique pairs coded as numeric
ls_combos <- split(mx_combos, rep(1:ncol(mx_combos), each = nrow(mx_combos)))
# for each pair in the list, create a 1 x 4 dataframe
ls_rows <- lapply(ls_combos, function(p) {
# lookup names of variables
v1 <- myVariables[p[1]]
v2 <- myVariables[p[2]]
# perform the cor.test()
htest <- cor.test(mySimData[[v1]], mySimData[[v2]])
# record pertinent info in a dataframe
data.frame(Var1 = v1,
Var2 = v2,
Pval = htest$p.value,
Rval = unname(htest$estimate))
})
# row bind the list of dataframes
dplyr::bind_rows(ls_rows)

Unbalanced ANOVA, R possibly disregarding duplicates

After running a one-way ANOVA on my dataset, I noticed that it's reporting the results as unbalanced despite having even numbers of entries for every variable.
Then, using ezPrecis to look at the dataframe, it seems that some values are not being counted despite having the correct number of rows registered. For example, using just method C from id 1, it says there's 46 values in ct even though it registers 50 rows (and has 50 values under ct). Is it possible that R is disregarding the duplicate values? Because looking at the raw file, there's 4 400's and 2 1684's. If you eliminate the duplicates, then that's precisely 4 items not counted which lines up with the 46 counted ct's when viewing through ezPrecis. Is this why the Anova is unbalanced? If so, how do you fix it?
library(ez)
data1 <- read.csv("data.csv")
data1
data1$id <- as.character(data1$id)
data1$id <- as_factor(data1$id)
data1$method <- as_factor(data1$method)
ezPrecis(data1)
ezDesign(data=data1, x=method, y=id)
data2 <- data1 %>%
group_by(method) %>%
summarise(mean = mean(ct, na.rm = TRUE),
sd = sd(ct, na.rm = TRUE),
se = sd(ct)/sqrt(length(ct)))
data2
data2anova <- ezANOVA(data=data1, dv=ct, wid=id, within=.(method),type=3,
detailed=TRUE, return_aov=TRUE)
data2anova
Raw data: https://ufile.io/cfe1w
All the rows are used in the ANOVA. The ezPrecis function informs you of the number of unique values in the column. This is clear from the help for the function where it refers to the "values" column as "unique." Why that column name got changed to "values" is anyone's guess.
The output from the ANOVA says "Estimated effects may be unbalanced". The replications for each variable, such as calculated using the replications function, is likely being inspected during the aov processing and warning the user that there could be unbalance present.
Your data frame yields the following:
replications(~ . - ct, data=data1)
id method testblock trial
150 100 60 30

Binning an unevenly distributed column in R

I have to a column in R which has uneven distribution like an exponential distribution. I want to normalize the data and then bin the data in subsequent buckets.
Saw following links which helps in normalizing the data but nothing with binning the data to different categories.
Normalizing data in R
Standardize data columns in R
Example: of how eneven distributed column would look like but with lot of rows.
dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
Qty = c(1,1,1,2,3,13,30,45))
I want it binned the column in 5 categories which may look like:
dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
Qty = c(1,1,1,2,3,13,30,45),
Binned_Category = c(1,1,1,1,2,3,4,5))
Above binned_Category is sample, the values may not look like this for the given data in real world. I just wanted to showcase how I want the output to look like.
This will help:
num_bins <- 5
findInterval(Qty, unique(quantile(Qty, prob = seq(0, 1, 1/num_bins))))

In r, how would I remove strata from 3-dimensional contingency tables depending on the number of observations in the strata?

I'm building 3d contingency tables from 3 variables in a data frame. Let's suppose I'm constructing these via
table(x,y,z)
Where z is the variable on which I'm stratifying. I'd like to get rid of any (,,z(i)) where the number of observations in that stratum is 1.
How might I do this? I had trouble figuring out how to count observations in the first place, which I thought I'd be able to use, with subset, to pare down my contingency tables.
Supposing your data is contained in a data frame object named data, this code should remove all data in strata with one observation.
data <- data[-which(data$z %in% which(table(data$z)==1)),]
EDIT
This appears to work now. I'm not sure if this will work in general, but it works for this situation.
data <- read.csv(file='~/Downloads/juveniles2forMax.csv')
data <- data.frame(
Urban = data$Urban,
RecidivismPlacement = data$RecidivismPlacement,
timeinjj = data$timeinjj
)
removeStrata <- function(data, z) {
data[-which(data[,z] %in% as.numeric(attr(which(table(data[,z])==1),"names"))),]
}
removeStrata(data=data, z='timeinjj')

Finding the mean of a variable within an imputed data set for population quintiles

I have a data set "base_data" which has missing values. I have therefore used the package 'Amelia' to impute the missing values into an object "a.output".
I have been able to find the mean for some variables within the imputed results using the following code:
q.out<-NULL
se.out<-NULL
for(i in 1:m) {
dclus <- svydesign(id=~site, data=a.output$base_data[[i]])
q.out <- rbind(q.out, coef(svymean(~hh_expenditure, dclus)))
se.out <- rbind(se.out, SE(svymean(~hh_expenditure, dclus)))}
I have combined the results using:
svymean.combine <- mi.meld(q = q.out, se = se.out)
Which gives me the mean and standard error for household expenditure (hh_expenditure) across the population.
However I have a variable which splits the population into wealth quintiles (wealth_quin).
As such, I am now wanting to find the average, and standard error, of the household expenditure per wealth_quin (a variable which is either 1,2,3,4,or 5).
I initially tried subsetting the imputed data, but this came up with many errors.
Is there a way to do this without having to split up the data into the 5 wealth quintiles before imputing the data?
Cheers,
Timothy
EDIT: HERE IS A WORKABLE EXAMPLE
require(Amelia)
require(survey)
a<-as.data.frame(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
b<-as.data.frame(c(1,2,2,1,2,1,1,2,1,2,2,1,1,2,1,2))
c<-as.data.frame(c(2,7,8,5,4,4,3,8,7,9,10,1,3,3,2,8))
d<-as.data.frame(c(3,9,7,4,5,5,2,10,8,10,12,2,4,4,3,7))
e<-as.data.frame(c(2500,8000,NA,4500,4500,NA,2500,NA,7400,9648,1112,1532,3487,3544,NA,7000)
impute<-cbind(a,b,c,d,e)
names(impute) <- c("X","site","var2","var3", "hh_inc")
so no we have a data frame to work with, with missing values for hh_inc which I want to impute.
first step, set the number of imputations
m<-5
now run the imputation:
a.output <- amelia(x = impute, m=m, autopri=0.5,cs="X",
idvars=c("site","var2"),
logs=c("hh_inc","var3"))
a.output is now holds the data from the 5 imputations.
What I now want to do is find the average (and standard error) hh_inc for site 1 and site 2 separately using the imputed values from amelia.
How is that possible to do? I know it is possible to do if I just ignore the NA's. But this might introduce bias, hence why I imputed the values in the first place.
Cheers,
Timothy
EDIT:
I have placed a bounty to this. If no one knows the exact way to do it, then the results from the individual imputed data sets can be combined using Rubins formula (http://sites.stat.psu.edu/~jls/mifaq.html#minf)
As such, I will award to bounty to someone who can transform the 5 separate imputed datasets from the Amelia object into 5 separate, complete, data frames.
require(Amelia)
require(survey)
require(data.table)
require(plotrix)
a<-as.data.frame(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
b<-as.data.frame(c(1,2,2,1,2,1,1,2,1,2,2,1,1,2,1,2))
c<-as.data.frame(c(2,7,8,5,4,4,3,8,7,9,10,1,3,3,2,8))
d<-as.data.frame(c(3,9,7,4,5,5,2,10,8,10,12,2,4,4,3,7))
e<-as.data.frame(c(2500,8000,NA,4500,4500,NA,2500,NA,7400,9648,1112,1532,3487,3544,NA,7000))
impute<-cbind(a,b,c,d,e)
names(impute) <- c("X","site","var2","var3", "hh_inc")
summary(impute)
m <- 5
a.output <- amelia(x = impute, m=m, autopri=0.5,cs="X",
idvars=c("site","var2"),
logs=c("hh_inc","var3"))
stats.out <- NULL
for(i in 1:m){
df2 <- data.table(a.output$imputations[[i]])
df3 <- data.frame(dataset=i,df2[,list(std.error(hh_inc),mean(hh_inc)), by="site"])
stats.out <- rbind(stats.out, df3)
}
colnames(stats.out) <- c("dataset","site","stdError","mean")
stats.out
I'm not sure I understand your question or the structure of your data (specifically the importance of whether the data is imputed or not) but here's how I've done some summary stats by group.
require(data.table)
require(plotrix)
# create some data
df1 <- data.frame(id=seq(1,50,1), wealth = runif(50)*1000)
df1$cutter <- cut(df1$wealth, 5, labels=FALSE)
head(df1)
# put the data into a data.table to speed things up
df2 <- as.data.table(df1)
head(df2)
grp1StdErr <- df2[,std.error(wealth), by="cutter"]
grp1Mean <- df2[,mean(wealth), by="cutter"]
Hope this helps.
Or, in one grouping step :
df2[,list(std.error(wealth),mean(wealth)), by=cut(wealth,5,labels=FALSE)]

Resources