Pull names of variables with a threshold of missing values - r

I am working with a data set containing 93 columns and many of them have a large percentage of missing values. I am looking for a way to streamline screening each column for the percentage of missing values and then return a list of names of those variables above that threshold to include in a new data set.
I have a function to check for missing values and return the percentage of missing:
#check for missing data
pMiss <- function(x) {
sum(is.na(x))/length(x)*100
}
#percent of data missing per column
x <- apply(dt2,2,pMiss)
How can I retrieve all the names [from x] of the columns where the percent of missing values is less than 20%? I would like to retrieve these names as a list that I can paste into a new data set, so I don't have to manually copy and paste each name from x.
Thank you in advance.

This'll work:
# example dataset
set.seed(123)
dat <- data.frame(a=sample(c(1,2,NA), size=20, replace=TRUE),
b=sample(c(1,2,NA), size=20, replace=TRUE),
c=sample(c(1:10,NA), size=20, replace=TRUE))
threshold <- .25 # for example
# get subset of colnames s.t. NA proportion is greater than threshold
names(dat)[sapply(dat, function(x) mean(is.na(x)) > threshold)]
## [1] "a" "b"

You can use the tidyverse approach:
require(tidyverse)
set.seed(123)
dat <- data.frame(a=sample(c(1,2,NA), size=20, replace=TRUE),
b=sample(c(1,2,NA), size=20, replace=TRUE),
c=sample(c(1:10,NA), size=20, replace=TRUE))
threshold <- .43
dat %>%
gather(var, value) %>%
group_by(var) %>%
summarise(prep.missing = sum(is.na(value)) / n()) %>%
filter(prep.missing < threshold)
var prep.missing
<chr> <dbl>
1 a 0.400
2 c 0.

df <- data.frame(a=c(NA,NA,1,1),b=c(NA,1,1,1),c=c(1,1,1,1))
x <- colMeans(is.na(df))
# a b c
# 0.50 0.25 0.00
x[x < .3]
# b c
# 0.25 0.00
names(x[x < .3])
# [1] "b" "c"
or all in one line:
names(df)[colMeans(is.na(df)) < .3]
# [1] "b" "c"

Related

Retrieve row names of maximum and second maximum values of a column in R

We have this df
# We create the df
x <- c(1,33,5,2,56,1)
y <- c(4,358,57,3,32,2)
df <- as.data.frame(cbind(x,y))
rownames(df) <- c("a", "b", "c", "d", "e", "f")
The df is:
x y
a 1 4
b 33 358
c 5 57
d 2 3
e 56 32
f 1 2
I would like to retrieve row names of the maximum value and its second highest value from the column x, and the same row names from the y column.
So the result will be e and b from the x column and b and c from the y column.
I tried these codes, but unsuccessfully.
rownames(df)[max(df$x)] # for the maximum value
nx <- length(df$x) # length of the x column
rownames(df)[sort(df$x, partial=nx-1)[nx-1]] # for the second max value
However, the results of the previous three code lines are:
NA # what's wrong?
6 # yeah, it is 6
"e" # nope, the second max is "b"
Where is the problem and how can I solve these issues?
We could loop over the columns, order it in decreasing, use that index to get the rownames, subset the first two
sapply(df, function(x) head(row.names(df)[order(x, decreasing = TRUE)], 2))
# x y
#[1,] "e" "b"
#[2,] "b" "c"
You were close: just locate the values
rownames(df[df$x == max(df$x),]) # for the maximum value
nx <- length(df$x) # length of the x column
rownames(df[df$x == sort(df$x, partial=nx-1)[nx-1],]) # for the second max value
An alternative using dplyr::filter and base R quantile function.
df %>%
filter(variable > quantile(.$variable, 0.975, na.rm = T))

transform a dataframe of frequencies to a wider format

I have a dataframe that looks like this.
input dataframe
position,mean_freq,reference,alternative,sample_id
1,0.002,A,C,name1
2,0.04,G,T,name1
3,0.03,A,C,name2
These data are nucleotide differences at a given position in a hypothetical genome, mean_freq is relative to the reference, so the first row means the proportion of C's are 0.002 implying the A are at 0.998.
I want to transform this to a different structure by creating new columns such that,
desired_output
position,G,C,T,A,sampleid
1,0,0.002,0,0.998,name1
2, 0.96,0,0.04,0,name
3,0,0.93,0,0.07,name2
I have attempted this approach
per_position_full_nt_freq <- function(x){
df <- data.frame(A=0, C=0, G=0, T=0)
idx <- names(df) %in% x$alternative
df[,idx] <- x$mean_freq
idx2 <- names(df) %in% x$reference
df[,idx2] <- 1 - x$mean_freq
df$position <- x$position
df$sampleName <- x$sampleName
return(df)
}
desired_output_dataframe <- per_position_full_nt_freq(input_dataframe)
I ran into an error
In matrix(value, n, p) :
data length [8905] is not a sub-multiple or multiple of the number of columns
additionally, I feel there has to be a more intuitive solution and presumably using tidyr or dplyr.
How do I conveniently transform the input dataframe to the desired output dataframe format?
Thank you.
One option would be to create a matrix of 0's with the 'G', 'C', 'T', 'A' column names, match with the column names of the original dataset, use the row/column index to assign the values and then cbind with the original dataset's 'position' and 'sample_id', columns
m1 <- matrix(0, ncol=4, nrow=nrow(df1), dimnames = list(NULL, c("G", "C", "T", "A")))
m1[cbind(seq_len(nrow(df1)), match(df1$alternative, colnames(m1)))] <- df1$mean_freq
m1[cbind(seq_len(nrow(df1)), match(df1$reference, colnames(m1)))] <- 0.1 - df1$mean_freq
cbind(df1['position'], m1, df1['sample_id'])
# position G C T A sample_id
#1 1 0.00 0.002 0.00 0.098 name1
#2 2 0.06 0.000 0.04 0.000 name1
#3 3 0.00 0.030 0.00 0.070 name2
The following should do the trick:
library(readr)
library(dplyr)
library(tidyr)
input_df <- read_csv(
'position,mean_freq,reference,alternative,sample_id
1,0.002,A,C,name1
2,0.04,G,T,name1
3,0.03,A,C,name2'
)
input_df %>%
mutate( ref_val = 0.1 -mean_freq) %>%
spread(alternative, mean_freq, fill=0) %>%
spread(reference, ref_val, fill=0) %>%
select( position, G, C, T, A, sample_id )
One assumption you have here is that the alternative and reference are distinct, otherwise you will get two columns with the same name, but different values. You need to handle for that with a couple of command at the beginning of your code if need be.

R: Concatenated values in column B based on values in column A

QUESTION: Using R, how would you create values in column B prefixed with a constant "1" + n 0's where n is the value in each row in column A?
#R CODE EXAMPLE
df <- as.data.frame(1:3);colnames(df)[1] <- "A";
print(df);
# A
# 1
# 2
# 3
preFixedValue <- 1; repeatedValue <- 0;
#pseudo code: create values in column B with n 0's prefixed with 1
df <- cbind(df,paste(rep(c(preFixedValue,repeatedValue), times = c(1,df[1:nrow(df),])),collapse = ""));
#expected/desired result
# A B
# 1 10
# 2 100
# 3 1000
USE CASE: Real data contains hundreds of rows in column A with random integers, not just three sequential int's as shown in the code above.
Below is an example using Excel to demonstrate what I want to do in R.
The rowwise() function in dplyr lets you make variables from column values in each row.
require(dplyr)
df <- data.frame(A = 1:3, B = NA)
preFixedValue <- 1; repeatedValue <- 0;
df <- df %>%
rowwise() %>%
mutate(B = as.numeric(paste0(c(preFixedValue, rep(repeatedValue, A)), collapse = "")))
For maximum flexibility, i.e. total freedom of choosing prefixed and repeated values as single values or vectors, and for simplicity of the syntax (one single line):
library(stringr)
df$B <- str_pad(preFixedValue, width = df$A, pad = repeatedValue, side = c("right"))
Would something like this work?
B<-10^(df$A)
df<-cbind(df,B)

R Minimum Value from Datatable Not Equal to a Particular Value

How do I find the minimum value from an R data table other than a particular value?
For example, there could be zeroes in the data table and the goal would be to find the minimum non zero value.
I tried using the sapply with min, but am not sure how to specify the extra criteria that we have so that the minimum is not equal to a certain value.
More generally, How do we find the minimum from a data table not equal to any element from a list of possible values?
If you want to find the minimum value from a vector while excluding certain values from that vector, then you can use %in%:
v <- c(1:10) # values 1 .. 10
v.exclude <- c(1, 2) # exclude the values 1 and 2 from consideration
min.exclude <- min(v[!v %in% v.exclude])
The logic won't change much if you are using a column from a data table/frame. In this case you can just replace the vector v with the apropriate column. If you have your excluded values in a list, then you can flatten it to produce your v.exclude vector.
This can be done with data.table (as the OP mentioned about data table in the post) after setting the key
library(data.table)
setDT(df, key='a')[!.(exclude)]
# a b
#1: 4 40
#2: 5 50
#3: 6 60
If we need the min value of 'a'
min(setDT(df, key='a')[!.(exclude)]$a)
#[1] 4
For finding the min in all the columns (using the setkey method), we loop over the columns of the dataset, set the key as each of the column, subset the dataset, get the min value in a previously created list object.
setDT(df)
MinVal <- vector('list', length(df))
for(j in seq_along(df)){
setkeyv(df, names(df)[j])
MinVal[[j]] <- min(df[!.(exclude)][[j]])
}
MinVal
#[[1]]
#[1] 4
#[[2]]
#[1] 10
data
df <- data.frame(a = c(0,2,3,2,1,2,3,4,5,6),
b = c(10,10,20,20,30,30,40,40,50,60))
exclude <- c(0,1,2,3)
Assuming you are working with a data.frame
Data
df <- data.frame(a = c(0,2,3,2,1,2,3,4,5,6),
b = c(10,10,20,20,30,30,40,40,50,60))
Values to exlude from our minimum search
exclude <- c(0,1,2,3)
we can find the minimum value from column a excluding our exclude vector
## minimum from column a
min(df[!df$a %in% exclude,]$a)
# [1] 4
Or from b
exclude <- c(10, 20, 30, 40)
min(df[!df$b %in% exclude,]$b)
# [1] 50
To return the row that corresponds to the minimum value
df[df$b == min( df[ !df$b %in% exclude, ]$b ),]
# a b
# 9 5 50
Update
To find the minimum across multiple rows we can do it this way:
## values to exclude
exclude_a <- c(0,1)
exclude_b <- c(10)
## exclude rows/values from each column we don't want
df2 <- df[!(df$a %in% exclude_a) & !(df$b %in% exclude_b),]
## order the data
df3 <- df2[with(df2, order(a,b)),]
## take the first row
df3[1,]
# > df3[1,]
# a b
#4 2 20
Update 2
To select from multiple columns we can iterate over them as #akrun has shown, or alternatively we can construct our subsetting formula using an expression and evaluate it inside our [ operation
exclude <- c(0,1,2, 10)
## construct a formula/expression using the column names
n <- names(df)
expr <- paste0("(", paste0(" !(df$", n, " %in% exclude) ", collapse = "&") ,")")
# [1] "( !(df$a %in% exclude) & !(df$b %in% exclude) )"
expr <- parse(text=expr)
df2 <- df[eval(expr),]
## order and select first row as before
df2 <- df2[with(df2, order(a,b)),]
df2 <- df2[1,]
And if we wanted to use data.table for this:
library(data.table)
setDT(df)[ eval(expr) ][order(a, b),][1,]
comparison of methods
library(microbenchmark)
fun_1 <- function(x){
df2 <- x[eval(expr),]
## order and select first row as before
df2 <- df2[with(df2, order(a,b)),]
df2 <- df2[1,]
return(df2)
}
fun_2 <- function(x){
df2 <- setDT(x)[ eval(expr) ][order(a, b),][1,]
return(df2)
}
## including #akrun's solution
fun_3 <- function(x){
setDT(df)
MinVal <- vector('list', length(df))
for(j in seq_along(df)){
setkeyv(df, names(df)[j])
MinVal[[j]] <- min(df[!.(exclude)][[j]])
}
return(MinVal)
}
microbenchmark(fun_1(df), fun_2(df), fun_3(df) , times=1000)
# Unit: microseconds
# expr min lq mean median uq max neval
# fun_1(df) 770.376 804.5715 866.3499 833.071 869.2195 2728.740 1000
# fun_2(df) 854.862 893.1220 952.1207 925.200 962.6820 3115.119 1000
# fun_3(df) 1108.316 1148.3340 1233.1268 1186.938 1234.3570 5400.544 1000

How do I add random `NA`s into a data frame

I created a data frame with random values
n <- 50
df <- data.frame(id = seq (1:n),
age = sample(c(20:90), n, rep = TRUE),
sex = sample(c("m", "f"), n, rep = TRUE, prob = c(0.55, 0.45))
)
and would like to introduce a few NA values to simulate real world data. I am trying to use apply but cannot get there. The line
apply(subset(df,select=-id), 2, function(x) {x[sample(c(1:n),floor(n/10))]})
will retrieve random values alright, but
apply(subset(df,select=-id), 2, function(x) {x[sample(c(1:n),floor(n/10))]<-NA})
will not set them to NA. Have tried with and within, too.
Brute force works:
for (i in (1:floor(n/10))) {
df[sample(c(1:n), 1), sample(c(2:ncol(df)), 1)] <- NA
}
But I'd prefer to use the apply family.
Return x within your function:
> df <- apply (df, 2, function(x) {x[sample( c(1:n), floor(n/10))] <- NA; x} )
> tail(df)
id age sex
[45,] "45" "41" NA
[46,] "46" NA "f"
[47,] "47" "38" "f"
[48,] "48" "32" "f"
[49,] "49" "53" NA
[50,] "50" "74" "f"
Apply returns an array, thereby converting all columns to the same type. You could use this instead:
df[,-1] <- do.call(cbind.data.frame,
lapply(df[,-1], function(x) {
x[sample(c(1:n),floor(n/10))]<-NA
x
})
)
Or use a for loop:
for (i in seq_along(df[,-1])+1) {
is.na(df[sample(seq_len(n), floor(n/10)),i]) <- TRUE
}
Using dplyr1 you could arrive at the desired solution using the following, compact, syntax:
set.seed(123)
library("tidyverse")
n <- 50
df <- data.frame(
id = seq (1:n),
age = sample(c(20:90), n, replace = TRUE),
sex = sample(c("m", "f"), n, replace = TRUE, prob = c(0.55, 0.45))
)
mutate(.data = as_tibble(df),
across(
.cols = all_of(c("age", "sex")),
.fns = ~ ifelse(row_number(.x) %in% sample(1:n(), size = (10 * n(
) / 100)), NA, .x)
))
Results
Approximatly 10% of values is replaced with NA per column. This follows from sample(1:n(), size = (10 * n() / 100))
count(.Last.value, sex)
# A tibble: 3 x 2
# sex n
# <chr> <int>
# 1 f 21
# 2 m 24
# 3 NA 5
# A tibble: 50 x 3
# id age sex
# <int> <int> <chr>
# 1 1 50 m
# 2 2 70 m
1 I'm loading tidyverse as replace_na is available via tidyr.
I think you need to return the x value from the function:
apply(subset(df,select=-id), 2, function(x)
{x[sample(c(1:n),floor(n/10))]<-NA; x})
but you also need to assign this back to the relevant subset of the data frame (and subset(...) <- ... doesn't work)
idCol <- names(df)=="id"
df[,!idCol] <- apply(df[,!idCol], 2, function(x)
{x[sample(1:n,floor(n/10))] <- NA; x})
(if you have only a single non-ID column you'll need df[,!idCol,drop=FALSE])
here is another simple way to go at it
your data frame
df<-mtcars
Number of missing required
nbr_missing<-20
sample row and column indices
y<-data.frame(row=sample(nrow(df),size=nbr_missing,replace = T),
col=sample(ncol(df),size = nbr_missing,replace = T))
remove duplication
y<-y[!duplicated(y),]
use matrix indexing
df[as.matrix(y)]<-NA
To introduce certain percentage of NAs in your dataframe you could use this:
while(sum(is.na(df) == TRUE) < (nrow(df) * ncol(df) * percentage/100)){
df[sample(nrow(df),1), sample(ncol(df),1)] <- NA
}
you could also change "(nrow(df) * ncol(df) * percentage/100)" to a fixed number of NAs
You can also use prodNA from the missForest package.
library(missForest)
library(dplyr)
> bind_cols(df[1],missForest::prodNA(df[-1],noNA=0.1))
# A tibble: 50 x 3
id age sex
<int> <int> <fct>
1 1 NA m
2 2 84 NA
3 3 82 f
4 4 42 f
5 5 35 m
6 6 80 m
7 7 90 f
8 8 NA NA
9 9 89 f
10 10 42 m
# … with 40 more rows
Simply pass your dataframe into the following function. The only arguments are the frame you want to add NAs to and the number of features (columns) you want to have with NAs.
add_random_nas_to_frame <- function(frame, num_features) {
col_order <- names(frame)
rand_cols <- sample(ncol(frame), num_features)
left_overs <- which(!names(frame) %in% names(frame[,rand_cols]))
other_frame <- frame[,left_overs]
nas_added <- data.frame(lapply(frame[,rand_cols], function(x) x[sample(c(TRUE, NA), prob = c(sample(100, 1)/100, 0.15), size = length(x), replace = TRUE)]))
final_frame <- cbind(other_frame, nas_added)
final_frame <- final_frame[,col_order]
return(final_frame)
}
For example, using the full dataset from banking dataset from UCI:
https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
bank <- read.table(file='path_to_data', sep =";", stringsAsFactors = F, header = T)
And viewing the original missing data:
We can see there is no missing data in the original frame.
Now applying our function:
bank_nas <- add_random_nas_to_frame(bank, 5)

Resources