Identifying Duplicates in `data.frame` Using `dplyr`

Identifying Duplicates in `data.frame` Using `dplyr` - r

I want to identify (not eliminate) duplicates in a data frame and add 0/1 variable accordingly (wether a row is a duplicate or not), using the R dplyr package.
Example:
| A B C D
1 | 1 0 1 1
2 | 1 0 1 1
3 | 0 1 1 1
4 | 0 1 1 1
5 | 1 1 1 1
Clearly, row 1 and 2 are duplicates, so I want to create a new variable (with mutate?), say E, that is equal to 1 in row 1,2,3 and 4 since row 3 and 4 are also identical.
Moreover, I want to add another variable, F, that is equal to 1 if there is a duplicate differing only by one column. That is, F in row 1,2 and 5 would be equal to 1 since they only differ in the B column.
I hope it is clear what I want to do and I hope that dplyr offers a smooth solution to this problem. This is of course possible in "base" R but I believe (hope) that there exists a smoother solution.

You can use dist() to compute the differences, and then a search in the resulting distance object can give the needed answers (E, F, etc.). Here is an example code, where X is the original data.frame:
W=as.matrix(dist(X, method="manhattan"))
X$E = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=0))
X$F = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=1))
Just change D= for the number of different columns needed.
It's all base R though. Using plyr::laply instead of sappy has same effect. dplyr looks overkill here.

Here is a data.table solution that is extendable to an arbitrary case (1..n columns the same)- not sure if someone can convert to dpylr for you. I had to change your dataset a bit to show your desired F column - in your example all rows would get a 1 because 3 and 4 are one column different from 5 as well.
library(data.table)
DT <- data.frame(A = c(1,1,0,0,1), B = c(0,0,1,1,1), C = c(1,1,1,1,1), D = c(1,1,1,1,1), E = c(1,1,0,0,0))
DT
A B C D E
1 1 0 1 1 1
2 1 0 1 1 1
3 0 1 1 1 0
4 0 1 1 1 0
5 1 1 1 1 0
setDT(DT)
DT_ncols <- length(DT)
base <- data.table(t(combn(1:nrow(DT), 2)))
setnames(base, c("V1","V2"),c("ind_x","ind_y"))
DT[, ind := .I)]
DT_melt <- melt(DT, id.var = "ind", variable.name = "column")
base <- merge(base, DT_melt, by.x = "ind_x", by.y = "ind", allow.cartesian = TRUE)
base <- merge(base, DT_melt, by.x = c("ind_y", "column"), by.y = c("ind", "column"))
base <- base[, .(common_cols = sum(value.x == value.y)), by = .(ind_x, ind_y)]
This gives us a data.frame that looks like this:
base
ind_x ind_y common_cols
1: 1 2 5
2: 1 3 2
3: 2 3 2
4: 1 4 2
5: 2 4 2
6: 3 4 5
7: 1 5 3
8: 2 5 3
9: 3 5 4
10: 4 5 4
This says that rows 1 and 2 have 5 common columns (duplicates). Rows 3 and 5 have 4 common columns, and 4 and 5 have 4 common columns. We can now use a fairly extendable format to flag any combination we want:
base <- melt(base, id.vars = "common_cols")
# Unique - common_cols == DT_ncols
DT[, F := ifelse(ind %in% unique(base[common_cols == DT_ncols, value]), 1, 0)]
# Same save 1 - common_cols == DT_ncols - 1
DT[, G := ifelse(ind %in% unique(base[common_cols == DT_ncols - 1, value]), 1, 0)]
# Same save 2 - common_cols == DT_ncols - 2
DT[, H := ifelse(ind %in% unique(base[common_cols == DT_ncols - 2, value]), 1, 0)]
This gives:
A B C D E ind F G H
1: 1 0 1 1 1 1 1 0 1
2: 1 0 1 1 1 2 1 0 1
3: 0 1 1 1 0 3 1 1 0
4: 0 1 1 1 0 4 1 1 0
5: 1 1 1 1 0 5 0 1 1
Instead of manually selecting, you can append all combinations like so:
# run after base <- melt(base, id.vars = "common_cols")
base <- unique(base[,.(ind = value, common_cols)])
base[, common_cols := factor(common_cols, 1:DT_ncols)]
merge(DT, dcast(base, ind ~ common_cols, fun.aggregate = length, drop = FALSE), by = "ind")
ind A B C D E 1 2 3 4 5
1: 1 1 0 1 1 1 0 1 1 0 1
2: 2 1 0 1 1 1 0 1 1 0 1
3: 3 0 1 1 1 0 0 1 0 1 1
4: 4 0 1 1 1 0 0 1 0 1 1
5: 5 1 1 1 1 0 0 0 1 1 0

Here is a dplyr solution:
test%>%mutate(flag = (A==lag(A)&
B==lag(B)&
C==lag(C)&
D==lag(D)))%>%
mutate(twice = lead(flag)==T)%>%
mutate(E = ifelse(flag == T | twice ==T,1,0))%>%
mutate(E = ifelse(is.na(E),0,1))%>%
mutate(FF = ifelse( ( (A +lag(A)) + (B +lag(B)) + (C+lag(C)) + (D + lag(D))) == 7,1,0))%>%
mutate(FF = ifelse(is.na(FF)| FF == 0,0,1))%>%
select(A,B,C,D,E,FF)
Result:
A B C D E FF
1 1 0 1 1 1 0
2 1 0 1 1 1 0
3 0 1 1 1 1 0
4 0 1 1 1 1 0
5 1 1 1 1 0 1

Related

nested for loop in R, where the second index counts inside the first one

I have for example a datset like this:
data <- data.frame(matrix(c(1,2,2,3,4,5,5,"a","a","b","a","a","a","b"), nrow = 7, ncol = 2, byrow = F))
X1 X2
1 a
2 a
2 b
3 a
4 a
5 a
5 b
then I add another variable "tag", initially set to 0.
data$tag <- 0
X1 X2 tag
1 a 0
2 a 0
2 b 0
3 a 0
4 a 0
5 a 0
5 b 0
I'd like to have "tag" equal to 1 for each row that is repeated, like:
X1 X2 tag
1 a 0
2 a 1
2 b 1
3 a 0
4 a 0
5 a 1
5 b 1
I used the followed code:
for (i in data$X1) {
for (j in 1:length(data$X1)) {
if (j==2) {data$tag[j] <- 1}
}
}
but it doesn't work like I would like to. I'd like the second loop (j) to work inside the previous one in order to obtain what I want, where j starts from 1 every time X1 changes.
How can I manage it?
Thanks a lot

Maybe you can try ave
within(
data,
tag <- +(ave(X1, X1, FUN = length) > 1)
)
which gives
X1 X2 tag
1 1 a 0
2 2 a 1
3 2 b 1
4 3 a 0
5 4 a 0
6 5 a 1
7 5 b 1

You can use duplicated from both the ends in base R :
data$tag <- as.integer(duplicated(data$X1) |
duplicated(data$X1, fromLast = TRUE))
data
# X1 X2 tag
#1 1 a 0
#2 2 a 1
#3 2 b 1
#4 3 a 0
#5 4 a 0
#6 5 a 1
#7 5 b 1

An option with add_count
library(dplyr)
data %>%
add_count(X1) %>%
mutate(n = +(n > 1))

R: Define starting condition for continous value

I´m trying to set up two new variables to incorporate into an existing data.frame which should be a running value starting at 1 (0) if a condition is met with respect to the IDs in the data.frame. So the data.frame is of similar structure to this:
ID Var1
1 0
1 2
1 5
1 12
2 0
2 2
2 NA
2 11
and I want to get to:
ID Var1 start stop
1 0 0 0
1 2 0 1
1 5 1 2
1 12 2 3
2 0 0 0
2 2 0 1
2 NA 1 2
2 11 2 3
Start should be a running value, starting once Var1 > 0 for the first time and stop should operate the same way. Start´s starting value should be 0 and stop´s starting value should be 1. It should further continue running, if Var1 takes on NA or 0 again in the course of the data.frame. I have tried doing the following:
df %>%
group_by(ID) %>%
mutate(stop = ifelse(Var1 > 0,
0:nrow(df), 0))
But the variable it returns doesn´t start with 0, but with the number of the row the condition is first met in.

Sorry, I don't speak dplyr but you can easily adapt this, since data.table is only used for group-by.
DF <- read.table(text = "ID Var1
1 0
1 2
1 5
1 12
2 0
2 2
2 NA
2 11", header = TRUE)
foo <- function(x) {
#quantify leading zeros:
x[is.na(x)] <- 0
lead0 <- cumsum(x > 0)
nlead0 <- sum(lead0 == 0)
#create result using sequence:
list(c(rep.int(0, nlead0), sequence(length(x) - nlead0) - 1),
c(rep.int(0, nlead0), sequence(length(x) - nlead0)))
}
library(data.table)
setDT(DF)
DF[, c("start", "stop") := foo(Var1), by = ID]
# ID Var1 start stop
#1: 1 0 0 0
#2: 1 2 0 1
#3: 1 5 1 2
#4: 1 12 2 3
#5: 2 0 0 0
#6: 2 2 0 1
#7: 2 NA 1 2
#8: 2 11 2 3

Here is base R option using ave + replace
transform(df,
Start = ave(ave(replace(Var1, is.na(Var1), 0) > 0, ID, FUN = cumsum) > 0, ID, FUN = function(x) cumsum(c(0, x))[-(length(x) + 1)]),
Stop = ave(ave(replace(Var1, is.na(Var1), 0) > 0, ID, FUN = cumsum) > 0, ID, FUN = cumsum)
)
or
transform(df,
Start = ave(ave(ave(replace(Var1, is.na(Var1), 0) > 0, ID, FUN = cumsum), ID, FUN = cumsum) > 1, ID, FUN = cumsum),
Stop = ave(ave(replace(Var1, is.na(Var1), 0) > 0, ID, FUN = cumsum) > 0, ID, FUN = cumsum)
)
which gives
ID Var1 Start Stop
1 1 0 0 0
2 1 2 0 1
3 1 5 1 2
4 1 12 2 3
5 2 0 0 0
6 2 2 0 1
7 2 NA 1 2
8 2 11 2 3

Check condition row wise for a number of columns [duplicate]

This question already has an answer here:
How to subset all rows in a dataframe that have a particular value
(1 answer)
Closed 2 years ago.
Data example:
df <- data.frame("a" = c(1,2,3,4), "b" = c(4,3,2,1), "x_ind" = c(1,0,1,1), "y_ind" = c(0,0,1,1), "z_ind" = c(0,1,1,1) )
> df
a b x_ind y_ind z_ind
1 1 4 1 0 0
2 2 3 0 0 1
3 3 2 1 1 1
4 4 1 1 1 1
I want to add a new column which checks if the whole row for the columns which end in "_ind" has all values equal to 1. If it does then returns 1 else returns 0. So the result dataframe looks like:
a b x_ind y_ind z_ind keep
1 1 4 1 0 0 0
2 2 3 0 0 1 0
3 3 2 1 1 1 1
4 4 1 1 1 1 1
I can select the columns by using df %>% select(contains("_ind")) however I am not sure how to do a rowwise operation which checks if every value in the row contains a 1, and then append the column back to the original dataframe.
Any help would be appreicated! Working with Dplyr but appreciate any solution

You can use rowSums when your df is equal to 1, i.e.
rowSums(df[grepl('_ind', names(df))] == 1) == ncol(df[grepl('_ind', names(df))])
#[1] FALSE FALSE TRUE TRUE
Continuing your dplyr attempt you can do,
df %>%
select(contains("_ind")) %>%
mutate(new = rowSums(. == 1) == ncol(.))
# x_ind y_ind z_ind new
#1 1 0 0 FALSE
#2 0 0 1 FALSE
#3 1 1 1 TRUE
#4 1 1 1 TRUE
#OR you can filter directly
df %>%
select(contains("_ind")) %>%
filter(rowSums(. == 1) == ncol(.))
# x_ind y_ind z_ind
#1 1 1 1
#2 1 1 1
If you want to also keep the origina columns, you can use,
df %>%
filter_at(vars(ends_with('_ind')), all_vars(. == 1))
# a b x_ind y_ind z_ind
#1 3 2 1 1 1
#2 4 1 1 1 1
NOTE: When we use (.), the dot refers to the resulting data frame. In this case, it refers to columns specify in the condition (i.e. to the columns that end with _ind)
Similarly in base R,
df[rowSums(df[grepl('_ind', names(df))] == 1) == ncol(df[grepl('_ind', names(df))]),]
# a b x_ind y_ind z_ind
#3 3 2 1 1 1
#4 4 1 1 1 1

You can use rowwise with c_across in new dplyr :
library(dplyr)
df %>% rowwise() %>% mutate(keep = +all(c_across(ends_with('ind')) == 1))
# a b x_ind y_ind z_ind keep
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#1 1 4 1 0 0 0
#2 2 3 0 0 1 0
#3 3 2 1 1 1 1
#4 4 1 1 1 1 1

You can use apply with all, using endsWith to get the columns ending with _ind and testing if they are == 1.
df$keep <- +(apply(df[,endsWith(colnames(df), "_ind")]==1, 1, all))
df
# a b x_ind y_ind z_ind keep
#1 1 4 1 0 0 0
#2 2 3 0 0 1 0
#3 3 2 1 1 1 1
#4 4 1 1 1 1 1
or using rowSums
df$keep <- +(rowSums(df[,endsWith(colnames(df), "_ind")]!=1) == 0)

Calculate number of time streak of categories change in a row in R

I have the following data frame in R:
Row number A B C D E F G H I J
1 1 1 0 0 1 0 0 1 1
2 1 0 0 0 1 0 0 1
3 1 0 0 0 1 0 0 1 1
I am trying to calculate the number of times the number changes between 1 and 0 excluding the Nulls
The result I am expecting is this
Row Number No of changes
---------- --------------
1 4
2 4
3 4
An explanation for row 1
In row 1, A has a null so we exclude that.
B and C have 1 which is our first set of values.
D and E have 0 which is our second set of values. Now Change = 1
F has our third set of values which is 1. Now Change = 1+1
G and H have 0 which is our third set of values. Now Change = 1+1+1
I and J have 1 which is our fourth set of values. Now Change = 1+1+1+1 =4

Here's a tidyverse approach.
I gather into longer format (from tidyr::pivot_longer), then add a helper column noting when we have a change from 0 to 1 or from 1 to 0, and then sum those by row.
library(tidyverse)
df %>%
# before tidyr 1.0, this would be gather(col, value, -1)
pivot_longer(-1, "col") %>%
group_by(Row.number) %>%
mutate(chg = value == 1 & lag(value) == 0 |
value == 0 & lag(value) == 1) %>%
summarize(no_chgs = sum(chg, na.rm = T))
# A tibble: 3 x 2
Row.number no_chgs
<int> <int>
1 1 4
2 2 4
3 3 4
Sample data:
df <- read.table(
header = T,
stringsAsFactors = F,
text = "'Row number' A B C D E F G H I J
1 NA 1 1 0 0 1 0 0 1 1
2 NA NA 1 0 0 0 1 0 0 1
3 NA 1 0 0 0 1 0 0 1 1")

Here's a data.table solution:
library(data.table)
dt <- as.data.table(df)
dt[,
no_change := max(rleid(na.omit(t(.SD)))) - 1,
by = RowNumber
]
dt
Alternatively, here's a base version:
apply(df[, -1],
1,
function(x) {
complete_case = complete.cases(x)
if (sum(complete_case) > 0) {
return(length(rle(x[complete_case])$lengths) - 1)
} else {
return (0)
}
}
)

Grouping and Counting instances?

Is it possible to group and count instances of all other columns using R (dplyr)? For example, The following dataframe
x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1
Turns to this (note: y is value that is being counted)
EDIT:- explaining the transformation, x is what I'm grouping by, for each number grouped, i want to count how many times 0 and 1 and 2 was mentioned, as in the first row in the transformed dataframe, we counted how many times x = 1 was equal to 0 in the other columns (y), so 0 was in column a one time, column b two times and column c one time
x y a b c
1 0 1 2 1
1 1 1 0 2
1 2 1 1 0
2 1 1 0 1
2 2 0 1 0

An approach with a combination of the melt and dcast functions of data.table or reshape2:
library(data.table) # v1.9.5+
dt.new <- dcast(melt(setDT(df), id.vars="x"), x + value ~ variable)
this gives:
dt.new
# x value a b c
# 1: 1 0 1 2 1
# 2: 1 1 1 0 2
# 3: 1 2 1 1 0
# 4: 2 1 1 0 1
# 5: 2 2 0 1 0
In dcast you can specify which aggregation function to use, but this is in this case not necessary as the default aggregation function is length. Without using an aggregation function, you will get a warning about that:
Aggregation function missing: defaulting to length
Furthermore, if you do not explicitly convert the dataframe to a data table, data.table will redirect to reshape2 (see the explanation from #Arun in the comments). Consequently this method can be used with reshape2 as well:
library(reshape2)
df.new <- dcast(melt(df, id.vars="x"), x + value ~ variable)
Used data:
df <- read.table(text="x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1", header=TRUE)

I'd use a combination of gather and spread from the tidyr package, and count from dplyr:
library(dplyr)
library(tidyr)
df = data.frame(x = c(1,1,1,2), a = c(0,1,2,1), b = c(0,0,2,2), c = c(0,1,1,1))
res = df %>%
gather(variable, value, -x) %>%
count(x, variable, value) %>%
spread(variable, n, fill = 0)
# Source: local data frame [5 x 5]
#
# x value a b c
# 1 1 0 1 2 1
# 2 1 1 1 0 2
# 3 1 2 1 1 0
# 4 2 1 1 0 1
# 5 2 2 0 1 0
Essentially, you first change the format of the dataset to:
head(df %>%
gather(variable, value, -x))
# x variable value
#1 1 a 0
#2 1 a 1
#3 1 a 2
#4 2 a 1
#5 1 b 0
#6 1 b 0
Which allows you to use count to get the information regarding how often certain values occur in columns a to c. After that, you reformat the dataset to your required format using spread.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Identifying Duplicates in `data.frame` Using `dplyr` - r

Related

nested for loop in R, where the second index counts inside the first one

R: Define starting condition for continous value

Check condition row wise for a number of columns [duplicate]

Calculate number of time streak of categories change in a row in R

Grouping and Counting instances?

Categories

Resources