R: Test for overlap of name values in dataframe - r

I have a dataframe filled with names.
For a given row in the dataframe, I'd like to compare that row to every row above it in the df and determine if the number of matching names is less than or equal to 4 for every row.
Toy Example where row 3 is the row of interest
"Jim","Dwight","Michael","Andy","Stanley","Creed"
"Jim","Dwight","Angela","Pam","Ryan","Jan"
"Jim","Dwight","Angela","Pam","Creed","Ryan" <--- row of interest
So first we'd compare row 3 to row 1 and see that the name overlap is 3, which meets the <= 4 criteria.
Then we'd compare row 3 to row 2 and see that the name overlap is 5 which fails the <= 4 criteria, ultimately returning a failed condition for being <=4 for every row above it.
Right now I am doing this operation using a for loop but the speed is much too slow for the dataframe size I am working with.

Example data
df <- as.data.frame(rbind(
c("Jim","Dwight","Michael","Andy","Stanley","Creed"),
c("Jim","Dwight","Angela","Pam","Ryan","Jan"),
c("Jim","Dwight","Angela","Pam","Creed","Ryan")
), stringsAsFactors = FALSE)
df
# V1 V2 V3 V4 V5 V6
# 1 Jim Dwight Michael Andy Stanley Creed
# 2 Jim Dwight Angela Pam Ryan Jan
# 3 Jim Dwight Angela Pam Creed Ryan
Operation and output (sapply over columns with %in% and take rowSums)
out_lgl <- rowSums(sapply(df, '%in%', unlist(df[3,]))) <= 4
out_lgl
# [1] TRUE FALSE FALSE
which(out_lgl)
# [1] 1
Explanation:
For each column, each element is compared to the third row (the vector unlist(df[3,])). The output is a matrix of logical values with the same dimensions as df, TRUE if there is a match.
sapply(df, '%in%', unlist(df[3,]))
# V1 V2 V3 V4 V5 V6
# [1,] TRUE TRUE FALSE FALSE FALSE TRUE
# [2,] TRUE TRUE TRUE TRUE TRUE FALSE
# [3,] TRUE TRUE TRUE TRUE TRUE TRUE
Then we can sum the TRUEs to see the number of matches for each row
rowSums(sapply(df, '%in%', unlist(df[3,])))
# [1] 3 5 6
Edit:
I have added the stringsAsFactors = FALSE option to the creation of df above. However, as far as I can tell the output of %in% is the same whether comparing factors with different levels or characters, so I don't believe this could change the results in any way. See example below
x <- c('b', 'c', 'z')
y <- c('a', 'b', 'g')
all.equal(x %in% y, factor(x) %in% factor(y))
# [1] TRUE

Similar solution as IceCreamToucan, but for any row.
For the data.frame:
df <- as.data.frame(rbind(
c("Jim","Dwight","Michael","Andy","Stanley","Creed"),
c("Jim","Dwight","Angela","Pam","Ryan","Jan"),
c("Jim","Dwight","Angela","Pam","Creed","Ryan")
)
For any row number i:
f <- function(i) {
if(i == 1) return(T)
r <- vapply(df[1:(i-1),], '%in%', unlist(df[i,]), FUN.VALUE = logical(i-1))
out_lgl <- rowSums(as.matrix(r)) <= 4
return(all(out_lgl))
}

Related

R: make 2 subset vectors so that values are different index-wise

I want to make 2 vectors subsetting from the same data, with replace=TRUE.
Even if both vectors can contain the same values, they cannot be the same at the same index position.
For example:
> set.seed(1)
> a <- sample(15, 10, replace=T)
> b <- sample(15, 10, replace=T)
> a
[1] 4 6 9 14 4 14 15 10 10 1
> b
[1] 4 3 11 6 12 8 11 15 6 12
> a==b
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
In this case, vectors a and b contain the same value at index 1 (value==4), which is wrong for my purposes.
Is there an easy way to correct this?
And can it be done on the subset step?
Or should I go through a loop checking element by element and if the values are identical, make another selection for b[i] and check again if it's not identical ad infinitum?
many thanks!
My idea is, instead of getting 2 samples of length 10 with replacement, get 10 samples of length 2 without replacement
library(purrr)
l <- rerun(10,sample(15,2,replace=FALSE))
Each element in l is a vector of integers of length two. Those two integers are guaranteed to be different because we specified replace=FALSE in sample
# from l extract all first element in each element, this is a
a <- map_int(l,`[[`,1)
# from list extract all second elements, this is b
b <- map_int(l,`[[`,2)
How about a two-stage sampling process
set.seed(1)
x <- 1:15
a <- sample(x, 10, replace = TRUE)
b <- sapply(a, function(v) sample(x[x != v], 1))
a != b
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
We first draw samples a; then for every sample from a, we draw a new sample from the set of values x excluding the current sample from a. Since we're doing this one-sample-at-a-time, we automatically allow for sampling with replacement.

Compare 2 dataframes for equality in R

I have 2 dataframes with 2 same columns. I want to check if the datasets are identical. The original datasets have some 700K records but I'm trying to figure out a way to do it using dummy datasets
I tried using compare, identical, all, all_equal etc. None of them returns me True.
The dummy datasets are -
a <- data.frame(x = 1:10, b = 20:11)
c <- data.frame(x = 10:1, b = 11:20)
all(a==c)
[1] FALSE
compare(a,c)
FALSE [FALSE, FALSE]
identical(a,c)
[1] FALSE
all.equal(a,c)
[1] "Component “x”: Mean relative difference: 0.9090909" "Component “b”: Mean relative difference: 0.3225806"
The datasets are entirely same, except for the order of the records. If these functions only work when the datasets are mirror images of each other, then I must try something else. If that is the case, can someone help with how do I get True for these 2 datasets (unordered)
dplyr's setdiff works on data frames, I would suggest
library(dplyr)
nrow(setdiff(a, c)) == 0 & nrow(setdiff(c, a)) == 0
# [1] TRUE
Note that this will not account for number of duplicate rows. (i.e., if a has multiple copies of a row, and c has only one copy of that row, it will still return TRUE). Not sure how you want duplicate rows handled...
If you do care about having the same number of duplicates, then I would suggest two possibilities: (a) adding an ID column to differentiate the duplicates and using the approach above, or (b) sorting, resetting the row names (annoyingly), and using identical.
(a) adding an ID column
library(dplyr)
a_id = group_by_all(a) %>% mutate(id = row_number())
c_id = group_by_all(c) %>% mutate(id = row_number())
nrow(setdiff(a_id, c_id)) == 0 & nrow(setdiff(c_id, a_id)) == 0
# [1] TRUE
(b) sorting
a_sort = a[do.call(order, a), ]
row.names(a_sort) = NULL
c_sort = c[do.call(order, c), ]
row.names(c_sort) = NULL
identical(a_sort, c_sort)
# [1] TRUE
Maybe a function to sort the columns before comparison is what you need. But it will be slow on large dataframes.
unordered_equal <- function(X, Y, exact = FALSE){
X[] <- lapply(X, sort)
Y[] <- lapply(Y, sort)
if(exact) identical(X, Y) else all.equal(X, Y)
}
unordered_equal(a, c)
#[1] TRUE
unordered_equal(a, c, TRUE)
#[1] TRUE
a$x <- a$x + .Machine$double.eps
unordered_equal(a, c)
#[1] TRUE
unordered_equal(a, c, TRUE)
#[1] FALSE
Basically what you want may be to compare the ordered underlying matrices.
all.equal(matrix(unlist(a[order(a[1]), ]), dim(a)),
matrix(unlist(c[order(c[1]), ]), dim(c)))
# [1] TRUE
identical(matrix(unlist(a[order(a[1]), ]), dim(a)),
matrix(unlist(c[order(c[1]), ]), dim(c)))
# [1] TRUE
You could wrap this into a function for more convenience:
om <- function(d) matrix(unlist(d[order(d[1]), ]), dim(d))
all.equal(om(a), om(c))
# [1] TRUE
You can use the new package called waldo
library(waldo)
a <- data.frame(x = 1:10, b = 20:11)
c <- data.frame(x = 10:1, b = 11:20)
compare(a,c)
And you get:
`old$x`: 1 2 3 4 5 6 7 8 9 10 and 9 more...
`new$x`: 10 ...
`old$b`: 20 19 18 17 16 15 14 13 12 11 and 9 more...
`new$b`:

R Programming : Logical dataframe to actual dataframe

I need to convert or manipulate the records based on the logical dataframe in R.
I want to match with original dataframe and populate only those values equal to true from original dataframe and null for false value and also maintain the dataframe structure as well. Please suggest
For eg :
Original dataframe
ID Name Title
1 John Mr
2 Mike Mr
3 Susan Dr
Logical Dataframe
ID Name Title
False False False
False True False
False False True
Expected Dataframe
ID Name Title
2 Mike <null>
3 <null> Dr
Here's a shot:
orig <- read.table(text="ID Name Title
1 John Mr
2 Mike Mr
3 Susan Dr", header = TRUE, stringsAsFactors = FALSE)
lgl <- read.table(text="ID Name Title
False False False
False True False
False False True", header = TRUE, stringsAsFactors = FALSE)
newdf <- mapply(function(d,l) { d[!l] <- NA; d; }, orig, lgl)
newdf
# ID Name Title
# [1,] NA NA NA
# [2,] NA "Mike" NA
# [3,] NA NA "Dr"
newdf[ rowSums(!is.na(newdf)) > 0, ]
# ID Name Title
# [1,] NA "Mike" NA
# [2,] NA NA "Dr"
Your expected output is inconsistent in that you have FALSE in your $ID column, but you keep them in your output. You can fix that by changing those to TRUE and changing the filter to rowSums(!is.na(newdf)) > 1.
Explanation:
mapply runs a function (named or anonymous) on one or more lists, like a "zipper" function. That is:
mapply(func, 1:3, 4:6, 7:9, SIMPLIFY=FALSE)
is equivalent to
list(func(1,4,7), func(2,5,8), func(3,6,9))
!is.na(newdf) creates a data.frame of the same dimensions/names, but all elements are logical.
since in general sum(<logical_vector>) returns a single integer of how many elements are true, rowSums(...) returns a vector, one element per row, where each element is the number of "trues" on that row.
... > 0 returns a logical vector, only passing the rows that have at least one non-NA element.
You said you wanted to always preserve $ID. In that case, you probably want to do (before process):
lgl$ID <- TRUE
and change the condition to ... > 1 to me "at least two non-NA elements, one of which we know is ID".

Identifying duplicates rows only with respect to some columns

I would like to create a variable (e.g. reap) taking the value TRUE only if the elements of some columns are duplicates of those of another row BUT the values on other columns are different.
The sample data will probably clarify my question:
V1 V2 V3
1. a b c
2. a b d
3. e f g
4. e f g
For example, if we want a variable taking value TRUE when rows have same V1 and V2 but different V3, then this variable should look like the following:
V1 V2 V3 reap
1. a b c TRUE
2. a b d TRUE
3. e f g FALSE
4. e f g FALSE
Thanks a lot for your help.
One idea would be to identify all duplicates of each column, and create a logical vector using rowSums and setting the condition != ncol(df)
rowSums(sapply(df, function(i) duplicated(i)|duplicated(i, fromLast = TRUE))) != ncol(df)
#[1] TRUE TRUE FALSE FALSE
To only consider the third column
m1 <- sapply(df, function(i) duplicated(i)|duplicated(i, fromLast = TRUE))
rowSums(m1) == 2 & !m1[,3]
#[1] TRUE TRUE FALSE FALSE

Unexpected behavior in subsetting aggregate function in R

I have a data frame that contains with the following format:
manufacturers pricegroup leads
harley <2500 #
honda <5000 #
... ... ..
I am using the aggregate function to pull out data in the following way:
aggregate( leads ~ manufacturer + pricegroup, data=leaddata,
FUN=sum, subset=(manufacturer==c("honda","harley")))
I noticed this is not returning the correct totals. The numbers for each manufacturer get smaller and smaller the more manufacturers I add to the subset group. However, if I use:
aggregate( leads ~ manufacturer + pricegroup, data=leaddata,
FUN=sum, subset=(manufacturer=="honda" | manufacturer=="harley"))
It returns the correct numbers. For the life of me, I can't figure out why. I would just use the OR operator, except I will be passing a list of manufacturers in dynamically. Any thoughts as to why the first construct is not working? Better, any thoughts on how to make it work? Thanks!
The problem is that == is alternating between the values of "honda" and "harley" and comparing with the value in the relevant position of your "manufacturer" variable. On the other hand, %in% (as suggested by MrFlick) and | are checking across the entire "manufacturer" variable before deciding which values to mark.
== will recycle values to the length of what is being compared.
This might be easier to see with an example:
set.seed(1)
v1 <- sample(letters[1:5], 10, TRUE)
v2 <- c("a", "b") ## Will be recycled to rep(c("a", "b"), 5) when comparing with v1
data.frame(v1, v2,
`==` = v1 == v2,
`%in%` = v1 %in% v2,
`|` = v1 == "a" | v1 == "b",
check.names = FALSE)
# v1 v2 == %in% |
# 1 b a FALSE TRUE TRUE
# 2 b b TRUE TRUE TRUE
# 3 c a FALSE FALSE FALSE
# 4 e b FALSE FALSE FALSE
# 5 b a FALSE TRUE TRUE
# 6 e b FALSE FALSE FALSE
# 7 e a FALSE FALSE FALSE
# 8 d b FALSE FALSE FALSE
# 9 d a FALSE FALSE FALSE
# 10 a b FALSE TRUE TRUE
Notice that in the == column, the only TRUE value was where "v1" and the recycled values of "v2" were the same.

Resources