which column is a duplicate column a duplicate of? - r

I have a data frame with possibly several duplicated columns.
I would like to do 2 thing (the following is a reproducible example, in practice I have dozens of columns):
First, Extract the unique columns. This can efficiently be accomplished with the following command:
library(dplyr)
df = data_frame(x = rnorm(100), y = rnorm(100), z = y, w = x)
df[!duplicated(as.list(df))]
This outputs columns x and y.
Next, I would like to find specifically which is the duplicate of x (w) and which is the duplicate of y (z). How can I do this efficiently?

Very similar to the answer provided by #Sotos
library(dplyr)
# example data
df = data_frame(x = rnorm(100), y = rnorm(100), z = y, w = x)
# function to compare vectors
f = function(x,y) { all(df[,x] == df[,y]) }
# vectorise function
f = Vectorize(f)
data.frame(t(combn(names(df),2)), stringsAsFactors = F) %>%
mutate(flag = f(X1, X2))
# X1 X2 flag
# 1 x y FALSE
# 2 x z FALSE
# 3 x w TRUE
# 4 y z TRUE
# 5 y w FALSE
# 6 z w FALSE
You can then focus on the pairs where the flag is TRUE.

You could use combn to get all combinations between your columns. Once you have that, there are various metrics to catch equality. Since you are dealing with floats, I would suggest to use correlation. If it is 1 then the 2 columns are the same, i.e.
setNames(combn(df, 2, FUN = function(i) all(cor(i) == 1)),
combn(names(df), 2, FUN = toString))
# x, y x, z x, w y, z y, w z, w
#FALSE FALSE TRUE TRUE FALSE FALSE
If correlation is not enough, then we can add the variance in there as well, i.e.
setNames(combn(df, 2, FUN = function(i) all(cor(i) == 1) & length(unique(as.vector(var(i)))) == 1),
combn(names(df), 2, toString))
# x, y x, z x, w y, z y, w z, w
#FALSE FALSE TRUE TRUE FALSE FALSE
You can structure the output any way you want.

Related

Thousand separator to numeric columns in R

I am trying to format numbers as shown (adding thousand separator). The function is working fine but post formatting the numbers, the numeric columns does not sort by numbers since there are characters
df <- data.frame(x = c(12345,35666,345,5646575))
format_numbers <- function (df, column_name){
df[[column_name]] <- ifelse(nchar(df[[column_name]]) <= 5, paste(format(round(df[[column_name]] / 1e3, 1), trim = TRUE), "K"),
paste(format(round(df[[column_name]] / 1e6, 1), trim = TRUE), "M"))
}
df$x <- format_numbers(df,"x")
> df
x
1 12.3 K
2 35.7 K
3 0.3 K
4 5.6 M
Can we make sure the numbers are sorted in descending/ascending order post formatting ?
Note : This data df is to be incorporated in DT table
The problem is the formating part. If you do it correctly--ie while maintaining your data as numeric, then everything else will fall in place. Here I will demonstrate using S3 class:
my_numbers <- function(x) structure(x, class = c('my_numbers', 'numeric'))
format.my_numbers <- function(x,..., d = 1, L = c('', 'K', 'M', 'B', 'T')){
ifelse(abs(x) >= 1000, Recall(x/1000, d = d + 1),
sprintf('%.1f%s', x, L[d]))
}
print.my_numbers <- function(x, ...) print(format(x), quote = FALSE)
'[.my_numbers' <- function(x, ..., drop = FALSE) my_numbers(NextMethod('['))
Now you can run your code:
df <- data.frame(x = c(12345,35666,345,5646575))
df$x <- my_numbers(df$x)
df
x
1 12.3K
2 35.7K
3 345.0
4 5.6M
You can use any mathematical operation on column x as it is numeric.
eg:
cbinding with its double and ordering from smallest to larges:
cbind(x = df, y = df*2)[order(df$x),]
x x
3 345.0 690.0 # smallest
1 12.3K 24.7K
2 35.7K 71.3K
4 5.6M 11.3M # largest ie Millions
Note that under the hood, x does not change:
unclass(df$x)
[1] 12345 35666 345 5646575 # Same as given

How to apply a custom function to every value in a dataframe?

I am trying to apply a custom function to every value of a dataframe. Here is the custom function and dataframe:
#function
test_fun <- function(x, y = 1) {
output <- x + y
output
}
#dataframe
df <- data.frame(a = c(1,2,3), b = c(4,5,6))
Now lets say I want to apply test_fun, with y = 2, to every value of df. This method doesn't seem to work:
lapply(df, test_fun(y = 2))
The function is vectorized, we can directly apply over the dataset
test_fun(df, y = 2)
# a b
#1 3 6
#2 4 7
##3 5 8
Regarding the OP's error, if we are not using lambda function, specify the argument as
lapply(df, test_fun, y = 2)
-output
#$a
#[1] 3 4 5
#$b
#[1] 6 7 8
Or specify the lambda function and then use (y = 2)
lapply(df, function(vec) test_fun(vec, y = 2))

Find variables that occur only in one cluster in data.frame in R

Using BASE R, I wonder how to answer the following question:
Are there any value on X or Y (i.e., variables of interest names) that occurs only in one element in m (as a cluster) but not others? If yes, produce my desired output below.
For example:
Here we see X == 3 only occurs in element m[[3]] but not m[[1]] and m[[2]].
Here we also see Y == 99 only occur in m[[1]] but not others.
Note: the following is a toy example, a functional answer is appreciated. AND X & Y may or may not be numeric (e.g., be string).
f <- data.frame(id = c(rep("AA",4), rep("BB",2), rep("CC",2)), X = c(1,1,1,1,1,1,3,3),
Y = c(99,99,99,99,6,6,6,6))
m <- split(f, f$id) # Here is `m`
mods <- names(f)[-1] # variables of interest names
Desired output:
list(AA = c(Y = 99), CC = c(X = 3))
# $AA
# Y
# 99
# $CC
# X
# 3
This is a solution based on rapply() and table().
ux <- rapply(m, unique)
tb <- table(uxm <- ux[gsub(rx <- "^.*\\.(.*)$", "\\1", names(ux)) %in% mods])
r <- Map(setNames, n <- uxm[uxm %in% names(tb)[tb == 1]], gsub(rx, "\\1", names(n)))
setNames(r, gsub("^(.*)\\..*$", "\\1", names(r)))
# $AA
# Y
# 99
#
# $CC
# X
# 3
tmp = do.call(rbind, lapply(names(f)[-1], function(x){
d = unique(f[c("id", x)])
names(d) = c("id", "val")
transform(d, nm = x)
}))
tmp = tmp[ave(as.numeric(as.factor(tmp$val)), tmp$val, FUN = length) == 1,]
lapply(split(tmp, tmp$id), function(a){
setNames(a$val, a$nm)
})
#$AA
# Y
#99
#$BB
#named numeric(0)
#$CC
#X
#3
This utilizes #jay.sf's idea of rapply() with an idea from a previous answer:
vec <- rapply(lapply(m, '[', , mods), unique)
unique_vec <- vec[!duplicated(vec) & !duplicated(vec, fromLast = T)]
vec_names <- do.call(rbind, strsplit(names(unique_vec), '.', fixed = T))
names(unique_vec) <- vec_names[, 2]
split(unique_vec, vec_names[, 1])
$AA
Y
99
$CC
X
3

How to find elements of one vector that aren't in another (not using setdiff)

I have two vectors,
x <- c(1,2,2,3,4)
y <- c(1,2,3)
And I want to get another vector of the elements that are in x that aren't in y; so in this case (2,4).
I've tried using the setdiff() function but this doesn't take into account duplicates (it would return only 4), so I'm not sure how to go about this.
Thank you!
Maybe try this:
x[-match(y,x,nomatch = 0)]
The nomatch = 0 is necessary to avoid mixing NAs with negative subscripts.
To deal with additional duplicates, as mentioned in the comments, another option might be to use vsetdiff from the package vecsets:
library(vecsets)
x = c(1, 2, 2, 3, 3, 4)
y = c(1, 2, 2, 3)
> vsetdiff(x,y)
[1] 3 4
It won't give the results as discussed by #Gregor, however, it should give the correct results based on the example:
x[duplicated(x) | !x %in% y]
[1] 2 4
In individual steps:
duplicated(x)
[1] FALSE FALSE TRUE FALSE FALSE
!x %in% y
[1] FALSE FALSE FALSE FALSE TRUE
duplicated(x) | !x %in% y
[1] FALSE FALSE TRUE FALSE TRUE
Considering OP's original example and reading #Gregor's comment, I wrote the following function that does what OP wants and also takes into account what #Gregor pointed out
## function to find values in x that are absent in y
x.not.in.y <- function(x, y) {
# get freq tables for x and y
x.tab <- table(x)
y.tab <- table(y)
# if a value is missing in y then set its freq to zero
y.tab[setdiff(names(x.tab), names(y.tab))] = 0
y.tab <- y.tab[names(y.tab) %in% names(x.tab)]
# get the difference of x and y freq and keep if > 0
diff.tab <- x.tab[order(names(x.tab))] - y.tab[order(names(y.tab))]
diff.tab <- diff.tab[diff.tab > 0]
# output vector of x values missing in y
unlist(
lapply(names(diff.tab), function(val) {
rep(as.numeric(val), diff.tab[val])
}),
use.names = F)
}
# OP's original data
x.not.in.y(x = c(1,2,2,3,4), y = c(1,2,3))
#> [1] 2 4
# #Gregor's data
x.not.in.y(x = c(1,2,2,3,3,4), y = c(1,2,2,3))
#> [1] 3 4
# some other data with extra value in y but absent in y
x.not.in.y(x = c(1,2,2,2,2,3,3,3,4,5), y = c(1,2,3,6))
#> [1] 2 2 2 3 3 4 5
Created on 2019-04-15 by the reprex package (v0.2.1)

Minimum distance between elements in two logical vectors

I have two logical vectors x and y and weighted values, z corresponding to each index. For column x values that are TRUE I'd like to find the nearest y column index that is also TRUE. Then grab the sum of z between min{x_i, y_i}. If there are two min{x_i, y_i} then the smaller sum of z is used.
x y z
1 FALSE TRUE 0.05647057
2 FALSE FALSE 0.09577802
3 TRUE FALSE 0.04150954
4 FALSE FALSE 0.07242995
5 FALSE TRUE 0.06220041
6 FALSE FALSE 0.01861535
7 FALSE FALSE 0.05056971
8 TRUE FALSE 0.07726933
9 FALSE TRUE 0.04669694
10 TRUE TRUE 0.02312497
There are 3 x values that are TRUE so we'll call them {x_1, x_2, x_3}. Here I demonstrate the summing of the minimum indexes between each x_i and it's nearest y_i neighbor. What is the most efficient base R way to accomplish this. I have a method at the end that utilizes 2 lapply telling me it's probably not efficient. I don't have a math background and usually there's some algebraic way to accomplish these sorts of tasks that is vectorized over the brute computational power.
## x_1
sum(z[3:5]) ## This one is smaller so use it
sum(z[1:3])
## x_2
sum(z[8:9])
## x_3
sum(z[10])
c(sum(z[3:5]), sum(z[8:9]), sum(z[10]))
[1] 0.17613990 0.12396627 0.02312497
MWE:
x <- y <- rep(FALSE, 10)
x[c(3, 8, 10)] <- TRUE
y[c(1, 5, 9, 10)] <- TRUE
set.seed(15)
z <- rnorm(10, .5, .25)/10
data.frame(x=x, y=y, z=z)
Here is an approach that is less than optimal:
dat <- data.frame(x=x, y=y, z=z)
sapply(which(dat[, "x"]), function(x) {
ylocs <- which(dat[, "y"])
dists <- abs(x - ylocs)
min.ylocs <- ylocs[min(dists) == dists]
min(sapply(min.ylocs, function(y, x2 = x) {
sum(dat[, "z"][x2:y])
}))
})
## [1] 0.17613990 0.12396627 0.02312497
I'd prefer to keep the solution within base.
This uses no loops or apply functions. We use na.locf from zoo to move the index of the last TRUE y up giving fwd and the next TRUE y back giving bck. Finally we determine which of the two corresponding sums is greater. This depends on na.locf in the zoo package but at the end we
extract the core code from zoo to avoid the dependence:
library(zoo) # na.locf
x <- dat$x
y <- dat$y
z <- dat$z
yy <- ifelse(y, TRUE, NA) * seq_along(y)
fwd <- na.locf(yy, fromLast = FALSE)[x]
bck <- na.locf(yy, fromLast = TRUE)[x]
cs <- cumsum(z)
pmin(cs[x] - cs[fwd] + z[fwd], cs[bck] - cs[x] + z[x])
The last line gives:
[1] 0.17613990 0.12396627 0.02312497
Here is a mini version of na.locf. The library call above could be replaced with this.
# code extracted from zoo package
na.locf <- function(x, fromLast = FALSE) {
L <- !is.na(x)
if (fromLast) rev(c(NA, rev(which(L)))[cumsum(rev(L)) + 1])
else c(NA, which(L))[cumsum(L)+1L]
}
REVISED: some improvements.

Resources