Data cleaning: Function to find very similar variables - r

I have some large data, which partly consists of very similar variables. Some variables have missing values (e.g. x3 and x5 in the example below) and some variables are similar, but with different labels (e.g. x2 and x5). In order to clean my data, I want to identify and eventually delete these similar variables. I am trying to write a function, which returns the column names of all similar variable pairs. Here is some exemplifying data:
# Example data
set.seed(222)
N <- 100
x1 <- round(rnorm(N, 0, 10))
x2 <- round(rnorm(N, 10, 20))
x3 <- x1
x3[sample(1:N, 7)] <- NA
x4 <- x1
x4[sample(1:N, 5)] <- round(rnorm(5, 0, 10))
x5 <- x2
x5 <- paste("A", x5, sep = "")
x5[sample(1:N, 15)] <- NA
df <- data.frame(x1, x2, x3, x4, x5)
df$x1 <- as.character(df$x1)
df$x2 <- as.character(df$x2)
df$x3 <- as.character(df$x3)
df$x4 <- as.character(df$x4)
df$x5 <- as.character(df$x5)
head(df)
As you can see, x1, x3, and x4 are very similar; and x2 and x5 are very similar as well. My function should print a list, which includes all pairs with the same values in 80% or more of the cases. Here is what I got so far:
# My attempt to write such a function
fun_clean <- function(data, similarity) {
output <- list()
data <- data[complete.cases(data), ]
for(i in 1:ncol(data)) {
if(i < ncol(data)) {
for(j in (i + 1):ncol(data)) {
similarity_ij <- sum(data[ , i] == data[ , j]) / nrow(data)
if(similarity_ij >= similarity) {
output[[length(output) + 1]] <- colnames(data)[c(i, j)]
}
}
}
}
output
}
fun_clean(data = df, similarity = 0.8)
I managed to identify the similarity of x1, x3, and x4. The similarity of x2 and x5 (i.e. similar variables with different labels) is not found by my function. Further, my function is very slow. Therefore, I have the following question:
Question: How could I identify all similar variables in a computationally efficient way?

In order to compare your columns, you need numeric values first. You can extract only the numeric values by using gsub() and then transform to numeric values. After this transformation, you'll be good to go:
df <- apply(df, 2, function(x) as.numeric( gsub("[^0-9]", "", x) ))
Now you can compare all columns by first using combn(5, 2) to get all pairs of columns you want to compare. Then you can use that to compare the columns and calculate the percentage of entries that are equal.
combs <- combn(ncol(df), 2)
res <- apply(combs, 2, function(x){
sum(df[, x[1]] == df[, x[2]], na.rm = TRUE)/nrow(df)
})
thresh <- 0.8
combs[, res > thresh]
# [,1] [,2] [,3] [,4]
# [1,] 1 1 2 3
# [2,] 3 4 5 4
So columns 1 & 3, 1 & 4, 2 & 5 and 3 & 4 are equal to each other in more than 80% of the cases.
Note: If one or both of the compared values have an NA, this will be considered as not a match!

In caret package there is a functionality to discover correlated variables and also variables that are lineal combinations of others:
http://topepo.github.io/caret/pre-processing.html

Related

R: calculating mean of positive values in every column of a data frame

I want to calculate the mean of every column of a date frame. But only the positive values should be considered. The positive mean-values of every column are summarised in one vector.
My code:
x <- data.frame(replicate(3, sample(-5000:7000, 1000, rep = TRUE)))
meanxpositive <- c(NA)
for (n_col in 1:3) {
z <- mean(x[which(x[, ncol] > 0)])
meanxpositive[n_col] <- z
}
This code don't work. Maybe someone have a better idea.
sapply(x, function (y) mean(y[y > 0]))
colMeans(as.matrix(x) * (x > 0))
A colMeans approach might look like:
x[x<=0] <- NA
colMeans(x, na.rm=T)
# X1 X2 X3
#3483.664 3626.115 3533.687
Since you also mentioned you wanted to see this solution using a for loop you could adjust your old code to:
meanxpositive<-rep(NA, ncol(x))
for (n_col in 1:3) {
z<-mean(x[which(x[,n_col]>0), n_col]) #Changed this line to reference "n_col" instead of "ncol"
meanxpositive[n_col]<-z
}
meanxpositive
#[1] 3483.664 3626.115 3533.687
Note I changed references to ncol to n_col and also put in a specific selection of n_col in your mean(...) before, you weren't selecting any columns.
Lastly, with for loops it is best to "pre-allocate" the memory of your result. In this case that means setting it to be the size that you know it should be (3). In R growing objects inside loops is extremely slow and inefficient.
Data:
set.seed(1)
x<-data.frame(replicate(3,sample(-5000:7000,1000,rep=TRUE)))
If you want to keep your data in a data.frame,
library(dplyr)
set.seed(47)
x <- data.frame(replicate(3, sample(-5000:7000, 1000, replace = TRUE)))
x %>% summarise_all(~mean(.x[.x > 0]))
#> X1 X2 X3
#> 1 3578.912 3535.614 3358.444
or with the old funs notation,
x %>% summarise_all(funs(mean(.[. > 0])))
#> X1 X2 X3
#> 1 3578.912 3535.614 3358.444
or in base R,
aggregate(. ~ TRUE, x, function(x){mean(x[x > 0])})
#> X1 X2 X3
#> 1 3578.912 3535.614 3358.444
or data.table,
library(data.table)
setDT(x)[, lapply(.SD, function(x){mean(x[x > 0])})]
#> X1 X2 X3
#> 1: 3578.912 3535.614 3358.444

Find the closest average of three numbers - code optimization

This may seem trivial, but I have a code that finds the average of the closest two numbers in a set of three numbers. So 5 examples:
x1 <- c(1,5,7)
x2 <- c(NA,2,3)
x3 <- c(2,6,4)
x4 <- c(1,NA,NA)
x5 <- c(1,3,1)
I would get an output of
y1 = 6
y2 = 2.5
y3 = 4
y4 = 1
y5 = 1
respectively. Basically, finding the closest 2 numbers, then averaging them, accounting for NA and ties.
This code is a monstrosity:
x <-x[!is.na(x)]
x <-x[order(x)]
y <-ifelse(length(x) == 1, x,
ifelse(length(x) == 2, mean(x),
ifelse(length(x) == 3,
ifelse(abs(x[1] - x[2]) == abs(x[2] - x[3]), mean(x),
ifelse(abs(x[1] - x[2]) > abs(x[2] - x[3]), mean(x[2:3]),
ifelse(abs(x[1] - x[2]) < abs(x[2] - x[3]), mean(x[1:2]),
"error"))), NA)))
It works, but because this is part of a larger for loop, I was wondering there's a better way of doing this..
We define an S3 generic with "list" and "default" methods.
The "default" method takes a vector and sort it (which also removes NA values) and then if the length of what is left is <= 1 it returns the single value or NA if none. If the length is 2 or the two successive differences are the same then it returns the mean of all values; otherwise, it finds the index of the first of the pair of the closest two values and returns the mean of the values.
The "list" method applies the default method repeatedly.
mean_min_diff <- function(x) UseMethod("mean_min_diff")
mean_min_diff.list <- function(x) sapply(x, mean_min_diff.default)
mean_min_diff.default <- function(x) {
x0 <- sort(x)
if (length(x0) <= 1) c(x0, NA)[1]
else if (length(x0) == 2 || sd(diff(x0)) == 0) mean(x0)
else mean(x0[seq(which.min(diff(x0)), length = 2)])
}
Now test it out:
mean_min_diff(x1)
## [1] 6
mean_min_diff(list(x1, x2, x3, x4, x5))
## [1] 6.0 2.5 4.0 1.0 1.0

Split a vector into multiple vectors in R

I want to split one vector(x) into multiple vectors(x1, x2 ,... , xn).
My input: x <- 1:10
My desire output:
x1 <- c(1,2,3,4)
x2 <- c(2,3,4,5)
x3 <- c(3,4,5,6)
x4 <- c(4,5,6,7)
x5 <- c(5,6,7,8)
x6 <- c(6,7,8,9)
x7 <- c(7,8,9,10)
My code(thanks to Mrs.Richard Herron for inspiration):
x <- 1:10
n <-3
vectors <- function(x, n) split(x, sort(rank(x) %% n))
vectors(x,n)
Thanks very much!
We can use lapply to loop over the sequence of 'x' such that we have a length of 4 in each of the elements in list, create a sequence (:) from that index to index + n, subset the 'x'. If needed to have individual vectors, we set the names of the list and use list2env.
n <- 3
lst <- lapply(1:(length(x)-n), function(i) x[i:(i+n)])
names(lst) <- paste0("x", seq_along(lst))
list2env(lst, envir = .GlobalEnv)
x1
#[1] 1 2 3 4
x2
#[1] 2 3 4 5
x3
#[1] 3 4 5 6
Or we can also create a matrix instead of multiple vectors in the global environment where each row corresponds to the vector of interest
matrix(x[1:4] + rep(0:6, each = 4), ncol=4, byrow = TRUE)

Automate pairwise comparisons between subsets in a dataframe R

I have a data.frame with several variables X1,X2... and a grouping variable "site" I want to find the proportion of X1 with site==1 greater than X1 with site==2, I can do that with a fixed number of site levels and each variable at a time, but I would like to generalize for any number of levels and several variables, following is an example:
# Generate data
set.seed(20130226)
n <- 100
x1 <- matrix(c(rnorm(n, mean = 2),rnorm(n, mean = 5)),ncol=2)
x2 <- matrix(c(rnorm(n, mean = 1), rnorm(n, mean = 4)),ncol=2)
x3 <- matrix(c(rnorm(n, mean = 3), rnorm(n, mean = 3)),ncol=2)
xx <- data.frame(x1,site=1)
xx <- rbind(xx, data.frame(x2,site=2))
xx <- rbind(xx, data.frame(x3,site=3))
# comparisons
s <- unique(xx$site)
me1 <- with(xx,xx[site==s[1],])
me2<- with(xx,xx[site==s[2],])
me3<- with(xx,xx[site==s[3],])
Pg1.gt.g2 <- sum(me1[,c("X1")]>me2[,c("X1")])/nrow(me1)
Pg1.gt.g3 <- sum(me1[,c("X1")]>me3[,c("X1")])/nrow(me1)
Pg2.gt.g3 <- sum(me2[,c("X1")]>me3[,c("X1")])/nrow(me1)
# build table
comp1 <- data.frame(Group=c(paste(s[1],">",s[2]),paste(s[1],">",s[3]),paste(s[2],">",s[3])), P=c(Pg1.gt.g2, Pg1.gt.g3,Pg2.gt.g3))
print(comp1)
I don't figure out how to do this for a different numbers of groups and several variables, maybe using plyr
Thanks!
I would reshape the data into a matrix where each column represents a group:
# Unique sites
s <- unique(xx$site)
# Columns are each group, data are X1 values
mat <- do.call(cbind, lapply(split(xx, xx$site), function(x) x$X1))
# Compare all pairs of sites
do.call(rbind, apply(combn(seq_along(s), 2), 2,
function(x) data.frame(g1=s[x[1]], g2=s[x[2]],
prop=sum(mat[,x[1]] > mat[,x[2]])/nrow(mat))))
# g1 g2 prop
# 1 1 2 0.83
# 2 1 3 0.20
# 3 2 3 0.09

Nested for loop in R

I wrote the following code, and I need to repeat this for 100 times, and I know I need to user another for loop, but I don't know how to do it. Here is the code:
mean <- c(5,5,10,10,5,5,5)
x <- NULL
u <- NULL
delta1 <- NULL
w1 <- NULL
for (i in 1:7 ) {
x[i] <- rexp(1, rate = mean[i])
u[i] <- (1/1.2)*runif(1, min=0, max=1)
y1 <- min(x,u)
if (y1 == min(x)) {
delta1 <- 1
}
else {
delta1 <- 0
}
if (delta1 == 0)
{
w1 <- NULL
}
else {
if(y1== x[[1]])
{
w1 <- "x1"
}
}
}
output <- cbind(delta1,w1)
output
I want the final output to be 100 rows* 3 columns matrix representing run number, delta1, and w1.
Any thought will be truly appreciated.
Here's what I gather you're trying to achieve from your code:
Given two vectors drawn from different distributions (Exponential and Uniform)
Find out which distribution the smallest number comes from
Repeat this 100 times.
Theres a couple of problems with your code if you want to achieve this, so here's a cleaned up example:
rates <- c(5, 5, 10, 10, 5, 5, 5) # 'mean' is an inbuilt function
# Initialise the output data frame:
output <- data.frame(number=rep(0, 100), delta1=rep(1, 100), w1=rep("x1", 100))
for (i in 1:100) {
# Generating u doesn't require a for loop. Additionally, can bring in
# the (1/1.2) out the front.
u <- runif(7, min=0, max=5/6)
# Generating x doesn't need a loop either. It's better to use apply functions
# when you can!
x <- sapply(rates, function(x) { rexp(1, rate=x) })
y1 <- min(x, u)
# Now we can store the output
output[i, "number"] <- y1
# Two things here:
# 1) use all.equal instead of == to compare floating point numbers
# 2) We initialised the data frame to assume they always came from x.
# So we only need to overwrite it where it comes from u.
if (isTRUE(all.equal(y1, min(u)))) {
output[i, "delta1"] <- 0
output[i, "w1"] <- NA # Can't use NULL in a character vector.
}
}
output
Here's an alternative, more efficient approach with replicate:
Mean <- c(5, 5, 10, 10, 5, 5, 5)
n <- 100 # number of runs
res <- t(replicate(n, {
x <- rexp(n = length(Mean), rate = Mean)
u <- runif(n = length(Mean), min = 0, max = 1/1.2)
mx <- min(x)
delta1 <- mx <= min(u)
w1 <- delta1 & mx == x[1]
c(delta1, w1)
}))
output <- data.frame(run = seq.int(n), delta1 = as.integer(res[ , 1]),
w1 = c(NA, "x1")[res[ , 2] + 1])
The result:
head(output)
# run delta1 w1
# 1 1 1 <NA>
# 2 2 1 <NA>
# 3 3 1 <NA>
# 4 4 1 x1
# 5 5 1 <NA>
# 6 6 0 <NA>

Resources