missingness mechanism and missing rate

missingness mechanism and missing rate - r

I want to apply a missingness mechanism such as(MCAR,MAR,NMAR) to my data under missing rate (5%.15%) if I have tow variables and nine instances:
Aj <- c(48,75,83,58,83,32,45,50,86)
As <- c(24,30,31,35,60,76,81,82,88)
as follows:
for Simulating MAR, we first randomly separated the variables into pairs
(Aj, As), 1 ≤ j , s ≤ r, where Aj was the variable into
which missing values were introduced, and As was the variable
that affected the missingness of Aj . Given a pair of
variables (Aj, As) and missing rate α,we first split the instances
into two equal-sized subsets according to their values
at As . If the variable As is numerical, we would find the
median of As and then assigned all the instances into
two subsets according to weather the instances have bigger
values than median As. for example we may let the instances
whose values at As are lower than the median 60 (instance
number 1–5) to be missing with the probability of 4α, that
is to say, Pr(Aj = missing|As ≤ 60) = 4α.
I wrote this code for missing mechanism in R
ifelse(As<=median(As),Aj==NA,Aj)
[1] NA NA NA NA NA 32 45 50 86
My question is how to add missing rate for example 5% to this code in R or another code for the above example and illustration.

This one-liner would give you missingness at rate alpha:
ifelse( (As <= median(As)) & (runif(length(As)) < alpha), NA, Aj)
If you want to have the observed rate close to alpha (you need to figure out how you want to handle rounding) you can do something like this:
Aj <- c(48,75,83,58,83,32,45,50,86)
As <- c(24,30,31,35,60,76,81,82,88)
# missingness rate
alpha <- 0.05
# create subset less than the median
b <- ifelse( As <= median(As) , NA, Aj)
# get the size of that subset (not known before hand due to tie handling)
n.b <- sum(is.na(b))
b.small <- Aj[is.na(b)]
# sample from the small subset at a fixed rate, setting sampled to NA
b.small[ sample(1:n.b, size=ceiling(n.b * alpha)) ] <- NA
b[is.na(b)] <- b.small
# b is now Aj with missingness
The output should be similar to
[1] 48 NA 83 58 83 32 45 50 86

Related

setting variable value by subsetting

this is my first question, so please bear with me
I am creating a new variable age.f.sex in my dataframe wm.13 using an already existing variable SB1. In the original dataframe, SB1 indicates the age of first sexual intercourse of women interviewed in UNICEF's Multiple Indicators Cluster Surveys. The values that SB1 can take are:
> sort(unique(wm.13$SB1))
[1] 0 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
[26] 30 31 32 33 34 35 36 37 38 39 40 41 42 44 48 95 97 99
Here is the meaning of the values SB1 can take
0 means she never had sex
97 and 99 mean "does not remember/does not know"
95 means that she had her first sexual intercourse when she started living with her husband/partner (for which there is a specific variable, i.e MA9)
Any number between 0 and 95 is the declared age ate their first sexual intercourse
there are also NAs that sort() does not show but they appear if I just use unique()
I created a new variable from SB1, which I called age.f.sex.
wm.13$age.f.sex <- wm.13$SB1
I had the 0, 97 and 99 values replaced with NAs, and I kept the original NAs in SB1. I did this using the following code:
wm.13$age.f.sex[wm.13$SB1 == 0] <- NA
wm.13$age.f.sex[wm.13$SB1 == 97] <- NA
wm.13$age.f.sex[wm.13$SB1 == 99] <- NA
wm.13$age.f.sex[is.na(wm.13$SB1)] <- NA
Everything worked fine until here. However, I am in trouble with the 95 value. I want to code so that the observations that have value 95 in SB1 (i.e. the age of first sexual intercourse) will have the value from MA9 (i.e. the age when the woman started living with her partner/husband) in my new variable age.f.sex.
I first started with this code
> wm.13$age.f.sex[wm.13$SB1 == 95] <- wm.13$MA9
but i got the following error message
Error in wm.13$age.f.sex[wm.13$SB1 == 95] <- wm.13$MA9 :
NAs are not allowed in subscripted assignments
After some researches in this website, I realised that I might need to subset the right-hand side of the code too, but honestly I do not know how to do it. I have a feeling that which() or if.else() might come of use here, but I cannot figure out their argument. Examples I have found in this website show how to impute one specific value, but I could not find anything on subsetting according to the value the observations take in another variable.
I hope I have been clear enough. Any suggestion will be much appreciated.
Thanks, Manolo

Perhaps you could try:
wm.13$age.f.sex <- ifelse(wm.13$SB1 %in% c(0,97,99) | is.na(wm.13$SB1), NA, ifelse(wm.13$SB1 == 95, wm.13$MA9, wm.13$SB1))
In short, it works like this: The code checks whether wm.13$SB1 is 0, 97, 99 or missing, and then returns NA. Subsequently, it checks whether wm.13$SB1 is 95, and if so, it returns the value on that row in the MA9 column. In all other cases it returns the SB1 value. Because of "wm.13$age.f.sex <-" at the beginning of the line the return values are assigned to your new age.f.sex variable.

As the error message indicates, it is not possible to do subscripted assignments when the filter contains NAs. A way to circumvent this is to explicitly include NA as a factor level. The following example illustrates a possible way to replace 95s by their corresponding value in a second column.
# example dataframe
df <- data.frame(a = c(NA, 3, 95, NA),
b = 1:4)
# set a to factor with NA as one of the levels (besides those in a and b)
df$a <- factor(df$a, levels = union(df$a, df$b), exclude = NULL)
# subscripted assignment (don't forget to filter b too!)
df$a[df$a == 95] <- df$b[df$a == 95]
# restore to numeric
df$a <- as.numeric(as.character(df$a))

R sample into two lists

I'm new to R and I want to sample from a list of 97 values. The list is composed of 3 different values (1,2 and 3), each representing a certain condition. The 97 values represent 97 individuals.
Lets assume the list is called original_pop. I want to randomly choose 50 individuals and store them as males and take the remaining 47 individuals and store them as females. A simple and similar scenario:
original_pop = [1 2 3 3 1 2 2 1 3 1 ...]
male_pop = [50 random values from original_pop]
female_pop = [the 47 values that are not in male_pop]
I created original_pop with sample so that the values are random but I don't know how to do the rest. Right now I've stored the first 50 values of original_pop as males and the last 47 as females and it might work because original_pop was randomly generated but I think it would be more appropriate to choose the values from original_pop in a random way and not in order.
Appreciate your responses!

n <- 97
In the absence of your original_pop data, we simulate it below.
original_pop <- sample(1:3, size=n, replace=TRUE)
maleIndexes <- sample(n, 50)
males <- original_pop[maleIndexes]
females <- original_pop[-maleIndexes]

Looping through data and subsetting data according to upper quantile

I'm currently trying to loop through my data, and by determining whether the value falls above the upper quantile I want to then subset the data so that only those that are above the upper quantile are selected. To demonstrate what I mean:-
vector <- c()
df <- as.integer(read.table(text = " 88 72 92 38 20 16 8 14 8 4 4 8 6 4 6 2 54 272 2 6"))
for(i in 1:length(df)){
current.bin <- i
window.size <- 5
window <- df[current.bin-window.size : current.bin+window.size]
upper.quant <- quantile(window, 0.95)
if(df[current.bin] > upper.quant){
vector[i] <- current.bin
}
}
str(vector)
int [1:18] NA NA 3 4 NA NA NA NA NA NA ...
So as I loop through, I want it to look at the values before and after (a window of 5) and use that to determine the upper quantile before deciding as to whether the value it's currently looking at falls above it or not. After that I want to use the current.bin value to then subselect data from another dataframe by using it to specify the rows I want to extract. However when I look at the vector that is produced, it's 2 less than the number of values in my df. I can't figure out why this is happening, any ideas?
Also how would I go about using the position of values that are above the upper quantile to subselect data? Using the df as an example, I want it to go something along the lines of:-
df <- df["row positions using values in vector", ]

How to peform dist and kNN in R for genomic data?

I have genomic data with missing values and I want to calculate the distance between the expression levels of each pair of genes by using the available values. Then i want to discover the K nearest neighbors to fill the gaps? How I can do that in R?
gene sample 1 sample 2 sample 3 sample 4
1 5555 NA 2151 5484
2 5564 NA NA NA
3 4544 4656 14546 45455
4 NA 54654 NA NA
...
How I can calculate the eucledian distance? I need to use a just one row at the time?
Sorry I´m new with genomic data and I can´t find this information anywhere.
Thanks.

I guess what you are trying to do is knn-imputation for the missing values, not knn-classification. There is a ready made function for this called impute.knn from the impute package on the bioconductor. Read the helpfile closely before use.
source("http://bioconductor.org/biocLite.R")
biocLite("impute")
require(impute)
x <- rnorm(1000, 50, 5) # 1000 random samples
x[sample(1:1000, 50)] <- NA # 50 are randomly made NA
x <- matrix(x, nrow = 10) # make a matrix
impute.knn(x)

Googling for R k nearest neighbor leads me to the knn function in the class package. In regard to your second question, calculating the euclidian distance is simply:
sqrt((sample1_x - sample1_y)^2 + ... + (sample4_x - sample4_y)^2)
where x and y are the indices of the rows you want to calculate the distance between. However, you have a lot of NA's in your data, I'm not sure how you need to deal with that as the euclidean distance is undefined when there are NA's involved.

How to remove an observation from a column that falls outside a desired range without leaving an NA

I am looking to remove multiple observations from one column within a dataframe based on their value without affecting the rest of the row.
df1=data.frame(c("male","female","male"),seq(1,30),seq(11,40))
names(df1) = c("col_a","col_b","col_c")
For example removing the values from column b that are below 5 or above 20 without affecting columns a or c. I am then looking to use this data for descriptive analysis and summaries.
Currently I am using this code to do the job:
df1$col_b[df1$col_b<5|df1$col_b>20] <- ""
df1$col_b<-as.numeric(df1$col_b)
However this creates NA values which get in the way of the analysis. Is there a way of doing this that does not create NA values or a quick way of removing them without affecting the row?

A numeric column can have normal values, NA, Inf, -Inf and NaN. But "empty" is not a possible value.
The reason for having NA is to mark that the value isn't available - seems exactly what you want! Using a negative number is just a more awkward way of doing the same thing - you'd have to remove all negative numbers before calculating mean, sum etc... You can do the same thing with NA - and that functionality is typically built into the functions: by specifying na.rm=TRUE.
df1 <- data.frame(col_a=c("male","female","male"),col_b=seq(1,30),col_c=seq(11,40))
df1$col_b[df1$col_b<5|df1$col_b>20] <- NA
sum(df1$col_b, na.rm=TRUE) # 200
median(df1$col_b, na.rm=TRUE) # 12.5

Maybe what you really need is mean(..., na.rm = TRUE). See ?mean, let the existence of NA help you.

Use subset:
> df2 <- subset(df1, ! ( df1$col_b<5|df1$col_b>20) )
> df2$col_b <- as.numeric(df2$col_b)
> df2
col_a col_b col_c
5 female 5 15
6 male 6 16
7 male 7 17
8 female 8 18
9 male 9 19
10 male 10 20
11 female 11 21
12 male 12 22
13 male 13 23
14 female 14 24
15 male 15 25
16 male 16 26
17 female 17 27
18 male 18 28
19 male 19 29
20 female 20 30

I'm taking it your ultimate intent is "How to ignore outliers in a column for subsequent analysis?"
You didn't say where the magic 5,20 range came from, nor what sort of analysis (mean/median/stdev, or something more complicated?).
You said: "aiming to use the column within the original dataframe for the analysis without subsetting as the purpose of this process is to remove outliers both visually and for calculations of averages."
If the magic 5,20 values came from a quantile (e.g. 5th-95th quantile, "middle 90th quantile"), you can compute arbitrary quantile values automatically with quantile(df1$col_b, c(0.05,0.95)). If you e.g. also want to see the median, pass the vector quantile(..., c(0.05,0.5,0.95))
Whereas if 5,20 is a known range, use the approach the others have shown you with logical indexing or subsetting to assign the outliers to NA. NA is your friend for analysis; it propagates into all calculations just like you'd want. NA is also your friend for plotting. Learn to love NA. Keep a copy of the original df (or just the original df$col_b) if you need to access the outlier values later.
If you want to experiment with distributions to see which one your data follows, see Ch 8 "Probability distributions" of http://cran.r-project.org/doc/manuals/R-intro.pdf
Here it all is in code:
#inrange <- function(x,a,b) { x>=a & x<=b }
inrange_else_NA <- function(x,minmax) { ifelse((x>=minmax[1] & x<=minmax[2]), x, NA) }
# If you want to save the original col_b and modify it in-place...
#df$col_b.orig <- df$col_b
# To exclude outliers outside a known range...
df$col_b_NAs <- inrange_else_NA(df$col_b, c(5, 20))
# ... or else to exclude outliers outside (say) middle 90th quantile
middle_90th_quantile <- as.vector(quantile(df$col_b, c(0.05,0.95)))
df$col_b_NAs <- inrange_else_NA(x,middle_90th_quantile)