Identifying duplicates rows only with respect to some columns - r

I would like to create a variable (e.g. reap) taking the value TRUE only if the elements of some columns are duplicates of those of another row BUT the values on other columns are different.
The sample data will probably clarify my question:
V1 V2 V3
1. a b c
2. a b d
3. e f g
4. e f g
For example, if we want a variable taking value TRUE when rows have same V1 and V2 but different V3, then this variable should look like the following:
V1 V2 V3 reap
1. a b c TRUE
2. a b d TRUE
3. e f g FALSE
4. e f g FALSE
Thanks a lot for your help.

One idea would be to identify all duplicates of each column, and create a logical vector using rowSums and setting the condition != ncol(df)
rowSums(sapply(df, function(i) duplicated(i)|duplicated(i, fromLast = TRUE))) != ncol(df)
#[1] TRUE TRUE FALSE FALSE
To only consider the third column
m1 <- sapply(df, function(i) duplicated(i)|duplicated(i, fromLast = TRUE))
rowSums(m1) == 2 & !m1[,3]
#[1] TRUE TRUE FALSE FALSE

Related

How do I use the across function across all columns of a dataframe rather than specify certain columns?

I have the following script so far that successfully creates a new column in my dataframe and populates it with a sum of how many times the value "TRUE" appears for each row of the dataframe:
data_1 <- data_1 %>% mutate(True_Count = rowSums(across(-c(Community.name), `%in%`, TRUE)))
You will notice after I bring in the across function, I specify that I want to drop a column from the function. However, I actually don't want to drop any columns from my function. I tried writing something like
across(data_1 %in%, TRUE) to indicate I want to go across the whole dataframe/all columns, but this is not the correct syntax.
Also, I tried to do this a much simpler way using just rowSums and no mutate as follows:
data_1$True_Count <- rowSums(df == TRUE) but all this did was create an empty column called True_Count and did not count the occurrences of TRUE logical values in each row. I also tried the same thing using a random string value that I know occurs exactly one time in my dataset: data_1$True_Count <- rowSums(df == "banana") but this did the same thing -- it created an empty column and did not count the instance of banana in my dataset.
Lastly there was one more behavior that I did not understand. If I run the first code, data_1 <- data_1 %>% mutate(True_Count = rowSums(across(-c(Community.name), `%in%`, TRUE))) more than once, the counts in the True_Count column cease to be correct.
It is really helpful if you share data in a reproducible format with the expected output so that everyone is on the same page regarding understanding of the question.
Since you did not share an example, I created one myself to explain the answer here. I have added 4 random columns with TRUE/FALSE values since it seems this is what your dataset contains.
data_1 <- data.frame(Community.name = c(T, F, T, F, F),
Community.code = c(T, F, F, T, T),
col1 = T,
col2 = c(F, F, T, F, F))
data_1
# Community.name Community.code col1 col2
#1 TRUE TRUE TRUE FALSE
#2 FALSE FALSE TRUE FALSE
#3 TRUE FALSE TRUE TRUE
#4 FALSE TRUE TRUE FALSE
#5 FALSE TRUE TRUE FALSE
Note that TRUE (logical) is different from "TRUE" (character). So first verify if your dataset contains logical values or character values before trying out the answers below.
This is your current code where you are dropping Community.name and calculating number of TRUE values in the dataset.
library(dplyr)
data_2 <- data_1 %>%
mutate(True_Count = rowSums(across(-c(Community.name), `%in%`, TRUE)))
data_2
# Community.name Community.code col1 col2 True_Count
#1 TRUE TRUE TRUE FALSE 2
#2 FALSE FALSE TRUE FALSE 1
#3 TRUE FALSE TRUE TRUE 2
#4 FALSE TRUE TRUE FALSE 2
#5 FALSE TRUE TRUE FALSE 2
Seems to work as expected. We ignore Community.Name and calculate number of TRUE values in the dataset.
Now your question,
I actually don't want to drop any columns from my function.
For that you can use everything() in across to include all the columns.
data_3 <- data_1 %>%
mutate(True_Count = rowSums(across(everything(), `%in%`, TRUE)))
data_3
# Community.name Community.code col1 col2 True_Count
#1 TRUE TRUE TRUE FALSE 3
#2 FALSE FALSE TRUE FALSE 1
#3 TRUE FALSE TRUE TRUE 3
#4 FALSE TRUE TRUE FALSE 2
#5 FALSE TRUE TRUE FALSE 2
Also note that everything() is default in ?across.
Also, I tried to do this a much simpler way using just rowSums and no mutate
Yes, using rowSums with no mutate is much simpler way giving the same answer.
data_1$True_Count <- rowSums(data_1)
data_1
# Community.name Community.code col1 col2 True_Count
#1 TRUE TRUE TRUE FALSE 3
#2 FALSE FALSE TRUE FALSE 1
#3 TRUE FALSE TRUE TRUE 3
#4 FALSE TRUE TRUE FALSE 2
#5 FALSE TRUE TRUE FALSE 2
Lastly there was one more behavior that I did not understand. If I run the first code, more than once, the counts in the True_Count column cease to be correct.
That might be because initially you don't have True_Count column in the dataset. So for the first time when you run the code True_Count column is added in your dataset data_1, now when you run the code second time it also uses True_Count for calculation which is something you don't want.

R: Test for overlap of name values in dataframe

I have a dataframe filled with names.
For a given row in the dataframe, I'd like to compare that row to every row above it in the df and determine if the number of matching names is less than or equal to 4 for every row.
Toy Example where row 3 is the row of interest
"Jim","Dwight","Michael","Andy","Stanley","Creed"
"Jim","Dwight","Angela","Pam","Ryan","Jan"
"Jim","Dwight","Angela","Pam","Creed","Ryan" <--- row of interest
So first we'd compare row 3 to row 1 and see that the name overlap is 3, which meets the <= 4 criteria.
Then we'd compare row 3 to row 2 and see that the name overlap is 5 which fails the <= 4 criteria, ultimately returning a failed condition for being <=4 for every row above it.
Right now I am doing this operation using a for loop but the speed is much too slow for the dataframe size I am working with.
Example data
df <- as.data.frame(rbind(
c("Jim","Dwight","Michael","Andy","Stanley","Creed"),
c("Jim","Dwight","Angela","Pam","Ryan","Jan"),
c("Jim","Dwight","Angela","Pam","Creed","Ryan")
), stringsAsFactors = FALSE)
df
# V1 V2 V3 V4 V5 V6
# 1 Jim Dwight Michael Andy Stanley Creed
# 2 Jim Dwight Angela Pam Ryan Jan
# 3 Jim Dwight Angela Pam Creed Ryan
Operation and output (sapply over columns with %in% and take rowSums)
out_lgl <- rowSums(sapply(df, '%in%', unlist(df[3,]))) <= 4
out_lgl
# [1] TRUE FALSE FALSE
which(out_lgl)
# [1] 1
Explanation:
For each column, each element is compared to the third row (the vector unlist(df[3,])). The output is a matrix of logical values with the same dimensions as df, TRUE if there is a match.
sapply(df, '%in%', unlist(df[3,]))
# V1 V2 V3 V4 V5 V6
# [1,] TRUE TRUE FALSE FALSE FALSE TRUE
# [2,] TRUE TRUE TRUE TRUE TRUE FALSE
# [3,] TRUE TRUE TRUE TRUE TRUE TRUE
Then we can sum the TRUEs to see the number of matches for each row
rowSums(sapply(df, '%in%', unlist(df[3,])))
# [1] 3 5 6
Edit:
I have added the stringsAsFactors = FALSE option to the creation of df above. However, as far as I can tell the output of %in% is the same whether comparing factors with different levels or characters, so I don't believe this could change the results in any way. See example below
x <- c('b', 'c', 'z')
y <- c('a', 'b', 'g')
all.equal(x %in% y, factor(x) %in% factor(y))
# [1] TRUE
Similar solution as IceCreamToucan, but for any row.
For the data.frame:
df <- as.data.frame(rbind(
c("Jim","Dwight","Michael","Andy","Stanley","Creed"),
c("Jim","Dwight","Angela","Pam","Ryan","Jan"),
c("Jim","Dwight","Angela","Pam","Creed","Ryan")
)
For any row number i:
f <- function(i) {
if(i == 1) return(T)
r <- vapply(df[1:(i-1),], '%in%', unlist(df[i,]), FUN.VALUE = logical(i-1))
out_lgl <- rowSums(as.matrix(r)) <= 4
return(all(out_lgl))
}

R: Appending values from row values specified by other row values to a list

I have a data frame with two columns - a group number and a name:
Group Name
1 A
4 B
2 C
3 D
4 E
I now want to make a list containing all the names that have groups in common.
I have tried with this for loop:
myfun <- function(x,g1,g2,g3,g4){
for (j in 1:nrow(x)){
if (x[1,j] == 1){
list(g1, list(c=x[2,j]))
} else if (x[1,j] == 2){
list(g2, list(c=x[2,j]))
} else if (x[1,j] == 3){
list(g3, list(c=x[2,j]))
} else if (x[1,j] == 4){
list(g4, list(c=x[2,j]))
}
}
}
where g1, g2, g3 and g4 are empty lists.
I get this error Error in if (x[1, i] == 1) { : argument is of length zero.
Do I have the right approach?
Edit:
How can I search and extract the level by a value in the list (lets say i want the group with the name B in it?
You can simplify your code (avoiding all the loops) by using an apply function (dat is the data)
res <- lapply(unique(dat$Group), function(g) unique(dat[dat$Group==g, "Name"]))
names(res) <- unique(dat$Group)
res[["4"]]
# [1] B E
# Levels: A B C D E
This creates a list where the indices of the list correspond to unique(dat$Group) and each element contains the unique "Name"s in that group.
Another solution, using plyr
library(plyr)
res <- dlply(dat, .(Group), function(x) unique(x$Name))
res[["4"]]
# [1] B E
# Levels: A B C D E
## If you want to extract all the groups with a "B" Name
inds <- unlist(lapply(res, function(x) "B" %in% x))
inds
# 1 2 3 4
# FALSE FALSE FALSE TRUE
## and to extract that Group
names(inds)[inds]
# [1] "4"

Unexpected behavior in subsetting aggregate function in R

I have a data frame that contains with the following format:
manufacturers pricegroup leads
harley <2500 #
honda <5000 #
... ... ..
I am using the aggregate function to pull out data in the following way:
aggregate( leads ~ manufacturer + pricegroup, data=leaddata,
FUN=sum, subset=(manufacturer==c("honda","harley")))
I noticed this is not returning the correct totals. The numbers for each manufacturer get smaller and smaller the more manufacturers I add to the subset group. However, if I use:
aggregate( leads ~ manufacturer + pricegroup, data=leaddata,
FUN=sum, subset=(manufacturer=="honda" | manufacturer=="harley"))
It returns the correct numbers. For the life of me, I can't figure out why. I would just use the OR operator, except I will be passing a list of manufacturers in dynamically. Any thoughts as to why the first construct is not working? Better, any thoughts on how to make it work? Thanks!
The problem is that == is alternating between the values of "honda" and "harley" and comparing with the value in the relevant position of your "manufacturer" variable. On the other hand, %in% (as suggested by MrFlick) and | are checking across the entire "manufacturer" variable before deciding which values to mark.
== will recycle values to the length of what is being compared.
This might be easier to see with an example:
set.seed(1)
v1 <- sample(letters[1:5], 10, TRUE)
v2 <- c("a", "b") ## Will be recycled to rep(c("a", "b"), 5) when comparing with v1
data.frame(v1, v2,
`==` = v1 == v2,
`%in%` = v1 %in% v2,
`|` = v1 == "a" | v1 == "b",
check.names = FALSE)
# v1 v2 == %in% |
# 1 b a FALSE TRUE TRUE
# 2 b b TRUE TRUE TRUE
# 3 c a FALSE FALSE FALSE
# 4 e b FALSE FALSE FALSE
# 5 b a FALSE TRUE TRUE
# 6 e b FALSE FALSE FALSE
# 7 e a FALSE FALSE FALSE
# 8 d b FALSE FALSE FALSE
# 9 d a FALSE FALSE FALSE
# 10 a b FALSE TRUE TRUE
Notice that in the == column, the only TRUE value was where "v1" and the recycled values of "v2" were the same.

Can't match character value to multiple columns

I'm trying to match a character value, "C", to multiple columns in a dataframe. Here's part of the frame:
X1 X2
1 F F
2 C C
3 D D
4 A# A#
Here's what happens when I try to match the value "C":
> "C" %in% frame[, 1]
[1] TRUE
> "C" %in% frame[, 1:2]
[1] FALSE
Considering that "C" is in both columns, I can't figure out why it's returning false. Is there a function or operator that can test to see if a value is present in multiple columns? My goal is to create a function that can sum the number of times a character value like "C" is found in specified columns.
Try:
apply(frame, 2, function(u) "C" %in% u)
You can also use is.element:
apply(frame, 2, function(u) is.element("C",u))
You probably want to use grepl here, which returns a logical vector. Then you can count the number of occurrences with sum.
> frame
X1 X2
1 F F
2 C C
3 D D
4 A# A#
> grepl('C', frame$X1)
[1] FALSE TRUE FALSE FALSE
> sum(grepl('C', frame$X1))
[1] 1
and to count the total number of Cs in every column you can use lapply
(note: apply is better suited for matrices, not data frames which are
lists.)
> sum(unlist(lapply(frame, function(col) grepl('C', col))))
[1] 2

Resources