Find groups of duplicates in data frame by all columns except one - r

I have a large dataframe. For some purposes I need to do the following:
Select one column in this data frame
Iterate on all rows of a given data frame except selected column
Select all rows of this data frame that are equal by all elements except one selected column
Group them by the way that group name is the row index and group values are indexes of duplicated rows.
I have wrote a function for this task, but it works slow because of nested loop. I would like to get some ideas how this code can be improved.
Say we have a dataframe like this:
V1 V2 V3 V4
1 1 2 1 2
2 1 2 2 1
3 1 1 1 2
4 1 1 2 1
5 2 2 1 2
And we want to get this list as a output:
diff.dataframe("V2", conf.new, conf.new)
Ouput:
$`1`
[1] 1
$`2`
[1] 2
$`3`
[1] 1 3
$`4`
[1] 2 4
$`5`
[1] 5
The following code reaces the goal, but it works too slow. Is it possible to improve it somehow?
diff.dataframe <- function(param, df1, df2){
excl.names <- c(param)
df1.excl <- data.frame(lapply(df1[, !names(df1) %in% excl.names], as.character), stringsAsFactors=FALSE)
df2.excl <- data.frame(lapply(df2[, !names(df2) %in% excl.names], as.character), stringsAsFactors=FALSE)
list.out <- list()
for (i in 1:nrow(df1.excl)){
for (j in 1:nrow(df2.excl)){
if (paste(df1.excl[i,],collapse='') == paste(df2.excl[j,], collapse='')){
if (!as.character(i) %in% unlist(list.out)){
list.out[[as.character(i)]] <- c(list.out[[as.character(i)]], j)
}
}
}
}
return(list.out)
}

Let's generate some data first
df <- as.data.frame(matrix(sample(2, 20, TRUE), 5))
# Produces df like this
V1 V2 V3 V4
1 2 1 1 1
2 2 1 2 2
3 1 1 2 2
4 1 2 1 1
5 1 2 1 1
We then loop through the lines with lapply. Each row i is then compared to all lines of df with apply (including itself). The rows with <= 1 differences returns TRUE, the others return FALSE producing a logical vector, which we convert to a numeric vector with which.
lapply(1:nrow(df), function(i)
apply(df, 1, function(x) which(sum(x != df[i,]) <= 1)))
# Produces output like this
[[1]]
[1] 1
[[2]]
[1] 2 3
[[3]]
[1] 2 3
[[4]]
[1] 4 5
[[5]]
[1] 4 5

Related

Count instances of overlap in two vectors in R

I am hoping to create a matrix that shows a count of instances of overlapping values for a grouping variable based on a second variable. Specifically, I am hoping to determine the degree to which primary studies overlap across meta-analyses in order to create a network diagram.
So, in this example, I have three meta-analyses that include some portion of three primary studies.
df <- data.frame(metas = c(1,1,1,2,3,3), studies = c(1,3,2,1,2,3))
metas studies
1 1 1
2 1 3
3 1 2
4 2 1
5 3 2
6 3 3
I would like it to return:
v1 v2 v3
1 3 1 2
2 1 1 0
3 2 0 2
The value in row 1, column 1 indicates that Meta-analysis 1 had three studies in common with itself (i.e., it included three studies). Row 1, column 2 indicates that Meta-analysis 1 had one study in common with Meta-analysis 2. Row 1, column 3 indicates that Meta-analysis 1 had two studies in common with Meta-analysis 3.
I believe you are looking for a symmetric matrix of intersecting studies.
dfspl <- split(df$studies, df$metas)
out <- outer(seq_along(dfspl), seq_along(dfspl),
function(a, b) lengths(Map(intersect, dfspl[a], dfspl[b])))
out
# [,1] [,2] [,3]
# [1,] 3 1 2
# [2,] 1 1 0
# [3,] 2 0 2
If you need names on them, you can go with the names as defined by df$metas:
rownames(out) <- colnames(out) <- names(dfspl)
out
# 1 2 3
# 1 3 1 2
# 2 1 1 0
# 3 2 0 2
If you need the names defined as v plus the meta name, go with
rownames(out) <- colnames(out) <- paste0("v", names(dfspl))
out
# v1 v2 v3
# v1 3 1 2
# v2 1 1 0
# v3 2 0 2
If you need to understand what this is doing, outer creates an expansion of the two argument vectors, and passes them all at once to the function. For instance,
outer(seq_along(dfspl), seq_along(dfspl), function(a, b) { browser(); 1; })
# Called from: FUN(X, Y, ...)
debug at #1: [1] 1
# Browse[2]>
a
# [1] 1 2 3 1 2 3 1 2 3
# Browse[2]>
b
# [1] 1 1 1 2 2 2 3 3 3
# Browse[2]>
What we ultimately want to do is find the intersection of each pair of studies.
dfspl[[1]]
# [1] 1 3 2
dfspl[[3]]
# [1] 2 3
intersect(dfspl[[1]], dfspl[[3]])
# [1] 3 2
length(intersect(dfspl[[1]], dfspl[[3]]))
# [1] 2
Granted, we are doing it twice (once for 1 and 3, once for 3 and 1, which is the same result), so this is a little inefficient ... it would be better to filter them to only look at the upper or lower half and transferring it to the other.
Edited for a more efficient process (only calculating each intersection pair once, and never calculating self-intersection.)
eg <- expand.grid(a = seq_along(dfspl), b = seq_along(dfspl))
eg <- eg[ eg$a < eg$b, ]
eg
# a b
# 4 1 2
# 7 1 3
# 8 2 3
lens <- lengths(Map(intersect, dfspl[eg$a], dfspl[eg$b]))
lens
# 1 1 2 ## btw, these are just names, from eg$a
# 1 2 0
out <- matrix(nrow = length(dfspl), ncol = length(dfspl))
out[ cbind(eg$a, eg$b) ] <- lens
out
# [,1] [,2] [,3]
# [1,] NA 1 2
# [2,] NA NA 0
# [3,] NA NA NA
out[ lower.tri(out) ] <- out[ upper.tri(out) ]
diag(out) <- lengths(dfspl)
out
# [,1] [,2] [,3]
# [1,] 3 1 2
# [2,] 1 1 0
# [3,] 2 0 2
Same idea as #r2evans, also Base R (and a bit less eloquent) (edited as required):
# Create df using sample data:
df <- data.frame(metas = c(1,1,1,2,3,3), studies = c(1,7,2,1,2,3))
# Test for equality between the values in the metas vector and the rest of
# of the values in the dataframe -- Construct symmetric matrix from vector:
m1 <- diag(v1); m1[,1] <- m1[1,] <- v1 <- rowSums(data.frame(sapply(df$metas, `==`,
unique(unlist(df)))))
# Coerce matrix to dataframe setting the names as desired; dropping non matches:
df_2 <- setNames(data.frame(m1[which(rowSums(m1) > 0), which(colSums(m1) > 0)]),
paste0("v", 1:ncol(m1[which(rowSums(m1) > 0), which(colSums(m1) > 0)])))

Count occurrences of value in a set of variables in R (per row)

Let's say I have a data frame with 10 numeric variables V1-V10 (columns) and multiple rows (cases).
What I would like R to do is: For each case, give me the number of occurrences of a certain value in a set of variables.
For example the number of occurrences of the numeric value 99 in that single row for V2, V3, V6, which obviously has a minimum of 0 (none of the three have the value 99) and a maximum of 3 (all of the three have the value 99).
I am really looking for an equivalent to the SPSS function COUNT: "COUNT creates a numeric variable that, for each case, counts the occurrences of the same value (or list of values) across a list of variables."
I thought about table() and library plyr's count(), but I cannot really figure it out. Vectorized computation preferred. Thanks a lot!
If you need to count any particular word/letter in the row.
#Let df be a data frame with four variables (V1-V4)
df <- data.frame(V1=c(1,1,2,1,L),V2=c(1,L,2,2,L),
V3=c(1,2,2,1,L), V4=c(L, L, 1,2, L))
For counting number of L in each row just use
#This is how to compute a new variable counting occurences of "L" in V1-V4.
df$count.L <- apply(df, 1, function(x) length(which(x=="L")))
The result will appear like this
> df
V1 V2 V3 V4 count.L
1 1 1 1 L 1
2 1 L 2 L 2
3 2 2 2 1 0
4 1 2 1 2 0
I think that there ought to be a simpler way to do this, but the best way that I can think of to get a table of counts is to loop (implicitly using sapply) over the unique values in the dataframe.
#Some example data
df <- data.frame(a=c(1,1,2,2,3,9),b=c(1,2,3,2,3,1))
df
# a b
#1 1 1
#2 1 2
#3 2 3
#4 2 2
#5 3 3
#6 9 1
levels=unique(do.call(c,df)) #all unique values in df
out <- sapply(levels,function(x)rowSums(df==x)) #count occurrences of x in each row
colnames(out) <- levels
out
# 1 2 3 9
#[1,] 2 0 0 0
#[2,] 1 1 0 0
#[3,] 0 1 1 0
#[4,] 0 2 0 0
#[5,] 0 0 2 0
#[6,] 1 0 0 1
Try
apply(df,MARGIN=1,table)
Where df is your data.frame. This will return a list of the same length of the amount of rows in your data.frame. Each item of the list corresponds to a row of the data.frame (in the same order), and it is a table where the content is the number of occurrences and the names are the corresponding values.
For instance:
df=data.frame(V1=c(10,20,10,20),V2=c(20,30,20,30),V3=c(20,10,20,10))
#create a data.frame containing some data
df #show the data.frame
V1 V2 V3
1 10 20 20
2 20 30 10
3 10 20 20
4 20 30 10
apply(df,MARGIN=1,table) #apply the function table on each row (MARGIN=1)
[[1]]
10 20
1 2
[[2]]
10 20 30
1 1 1
[[3]]
10 20
1 2
[[4]]
10 20 30
1 1 1
#desired result
Here is another straightforward solution that comes closest to what the COUNT command in SPSS does — creating a new variable that, for each case (i.e., row) counts the occurrences of a given value or list of values across a list of variables.
#Let df be a data frame with four variables (V1-V4)
df <- data.frame(V1=c(1,1,2,1,NA),V2=c(1,NA,2,2,NA),
V3=c(1,2,2,1,NA), V4=c(NA, NA, 1,2, NA))
#This is how to compute a new variable counting occurences of value "1" in V1-V4.
df$count.1 <- apply(df, 1, function(x) length(which(x==1)))
The updated data frame contains the new variable count.1 exactly as the SPSS COUNT command would do.
> df
V1 V2 V3 V4 count.1
1 1 1 1 NA 3
2 1 NA 2 NA 1
3 2 2 2 1 1
4 1 2 1 2 2
5 NA NA NA NA 0
You can do the same to count how many time the value "2" occurs per row in V1-V4. Note that you need to select the columns (variables) in df to which the function is applied.
df$count.2 <- apply(df[1:4], 1, function(x) length(which(x==2)))
You can also apply a similar logic to count the number of missing values in V1-V4.
df$count.na <- apply(df[1:4], 1, function(x) sum(is.na(x)))
The final result should be exactly what you wanted:
> df
V1 V2 V3 V4 count.1 count.2 count.na
1 1 1 1 NA 3 0 1
2 1 NA 2 NA 1 1 2
3 2 2 2 1 1 3 0
4 1 2 1 2 2 2 0
5 NA NA NA NA 0 0 4
This solution can easily be generalized to a range of values.
Suppose we want to count how many times a value of 1 or 2 occurs in V1-V4 per row:
df$count.1or2 <- apply(df[1:4], 1, function(x) sum(x %in% c(1,2)))
A solution with functions from the dplyr package would be the following:
Using the example data set from LechAttacks answer:
df <- data.frame(V1=c(1,1,2,1,NA),V2=c(1,NA,2,2,NA),
V3=c(1,2,2,1,NA), V4=c(NA, NA, 1,2, NA))
Count the appearances of "1" and "2" each and both combined:
df %>%
rowwise() %>%
mutate(count_1 = sum(c_across(V1:V4) == 1, na.rm = TRUE),
count_2 = sum(c_across(V1:V4) == 2, na.rm = TRUE),
count_12 = sum(c_across(V1:V4) %in% 1:2, na.rm = TRUE)) %>%
ungroup()
which gives the table:
V1 V2 V3 V4 count_1 count_2 count_12
1 1 1 1 NA 3 0 3
2 1 NA 2 NA 1 1 2
3 2 2 2 1 1 3 4
4 1 2 1 2 2 2 4
5 NA NA NA NA 0 0 0
In my effort to find something similar to Count from SPSS in R is as follows:
`df <- data.frame(a=c(1,1,NA,2,3,9),b=c(1,2,3,2,NA,1))` #Dummy data with NAs
`df %>%
dplyr::mutate(count = rowSums( #this allows calculate sum across rows
dplyr::select(., #Slicing on .
dplyr::one_of( #within select use one_of by clarifying which columns your want
c('a','b'))), na.rm = T)) #once the columns are specified, that's all you need, na.rm is cherry on top
That's how the output looks like
>df
a b count
1 1 1 2
2 1 2 3
3 NA 3 3
4 2 2 4
5 3 NA 3
6 9 1 10
Hope it helps :-)

Find index of removed unique pairs

I have a data frame of 2 columns and a vector of the same length. I am trying to remove all duplicated pairs in the data frame and at the same index, remove it from the vector.
I have a data frame:
> from <- c(1,1,2,4,3)
> to <- c(1,1,2,3,5)
> ft <- data.frame(from,to)
> ft
from to
1 1 1
2 1 1
3 2 2
4 4 3
5 3 5
And vector:
> dist <- c(1,2,3,4,5)
> dist
[1] 1 2 3 4 5
I used the function unique() to remove all duplicated pairs:
> unique(ft)
from to
1 1 1
3 2 2
4 4 3
5 3 5
How can I get the index of where every pair from "ft" has been removed so that I can remove it from "dist" which would be the 2 in this case.
As #eddi notes, you can get a logical vector that indicates which rows are duplicates with duplicated(). I combined that with which(), which returns the number associated with the logical that is TRUE (i.e., the duplicated row). You can then create a new data.frame (vector, etc.) by using - to not include the indicated rows in the subscript of your object.
Edit: In the comments, #DWin points out a better way than using -. If we negate the duplicated() function with !, we will get a vector that we can use to determine which rows to retain:
> from <- c(1,1,2,4,3)
> to <- c(1,1,2,3,5)
> ft <- data.frame(from,to)
> ft
from to
1 1 1
2 1 1
3 2 2
4 4 3
5 3 5
> dist <- c(1,2,3,4,5)
> dist
[1] 1 2 3 4 5
> remove <- !duplicated(ft)
> remove
[1] TRUE FALSE TRUE TRUE TRUE
> ft.new <- ft[which(remove), ]
> ft.new
from to
1 1 1
3 2 2
4 4 3
5 3 5
> dist.new <- dist[which(remove)]
> dist.new
[1] 1 3 4 5

Convert a list of varying lengths into a dataframe

I am trying to convert a simple list of varying lengths into a data frame as shown below. I would like to populate the missing values with NaN. I tried using ldply, rbind, as.data.frame() but I failed to get it into the format I want. Please help.
x=c(1,2)
y=c(1,2,3)
z=c(1,2,3,4)
a=list(x,y,z)
a
[[1]]
[1] 1 2
[[2]]
[1] 1 2 3
[[3]]
[1] 1 2 3 4
Output should be:
x y z
1 1 1
2 2 2
NaN 3 3
NaN NaN 4
Using rbind.fill.matrix from "plyr" gets you very close to what you're looking for:
> library(plyr)
> t(rbind.fill.matrix(lapply(a, t)))
[,1] [,2] [,3]
1 1 1 1
2 2 2 2
3 NA 3 3
4 NA NA 4
This is a lot of code, so not as clean as Ananda's solution, but it's all base R:
maxl <- max(sapply(a,length))
out <- do.call(cbind, lapply(a,function(x) x[1:maxl]))
# out <- matrix(unlist(lapply(a,function(x) x[1:maxl])), nrow=maxl) #another way
out <- as.data.frame(out)
#names(out) <- names(a)
Result:
> out
V1 V2 V3
1 1 1 1
2 2 2 2
3 NA 3 3
4 NA NA 4
Note: names of the resulting df will depend on the names of your list (a), which doesn't currently have names.

Check for unique elements

just a simple question.
I have a data frame(only one vector is shown) that looks like:
cln1
A
b
A
A
c
d
A
....
I would like the following output:
cln1
b
c
d
In other words I would like to remove all items that are replicated. The functions "unique" as well as "duplicated" return the output including the replicated element represented one time. I would like to remove it definitively.
You can use setdiff for that :
R> v <- c(1,1,2,2,3,4,5)
R> setdiff(v, v[duplicated(v)])
[1] 3 4 5
You could use count from the plyr package to count the occurences of an item, and delete all who occur more than once.
library(plyr)
l = c(1,2,3,3,4,5,6,6,7)
count_l = count(l)
x freq
1 1 1
2 2 1
3 3 2
4 4 1
5 5 1
6 6 2
7 7 1
l[!l %in% with(count_l, x[freq > 1])]
[1] 1 2 4 5 7
Note the !, which means NOT. You of course put this in a oneliner:
l[!l %in% with(count(l), x[freq > 1])]
Another way using table:
With #juba's data:
as.numeric(names(which(table(v) == 1)))
# [1] 3 4 5
For OP's data, since its a character output, as.numeric is not required.
names(which(table(v) == 1))
# [1] "b" "c" "d"

Resources