Create ID variable: if ≥1 column duplicate then mark as duplicate - r

Ive seen many questions about creating a new ID variable, based on multiple columns conditions. However it is usually if var1 AND var2 are double, then mark as duplicate number.
My question is how do you create a new variable ID and mark for duplicates if
var1 is duplicate, OR
var2 is duplicate, OR
var3 is duplicate.
Example dataset (EDITED):
pat var1 var2 var3
1 1 1 10 1
2 2 16 10 11
3 3 21 27 2
4 4 22 29 2
5 5 31 35 3
6 6 44 47 4
7 7 5 50 5
8 8 6 60 6
9 9 7 70 7
10 10 8 80 7
11 11 9 90 8
12 12 10 11 9
13 13 11 13 91
14 14 11 14 10
15 15 NA 15 15
16 16 NA 15 16
17 17 12 NA 17
18 18 13 NA 18
sample <- data.frame(pat = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18),
var1 = c(1,16,21,22,31,44,5,6,7,8,9,10,11,11, NA,NA,12,13),
var2 = c(10,10,27,29,35,47,50,60,70,80,90,11,13,14,15,15,NA,NA),
var3 = c(1,11,2,2,3,4,5,6,7,7,8,9,91,10,15,16,17,18)
So if one of the three var variables is duplicated, then the new ID variable should show a duplicate ID number.
Desired output (EDITED):
pat var1 var2 var3 ID
1 1 1 10 1 1
2 2 16 10 11 1
3 3 21 27 2 2
4 4 22 29 2 2
5 5 31 35 3 3
6 6 44 47 4 4
7 7 5 50 5 5
8 8 6 60 6 6
9 9 7 70 7 7
10 10 8 80 7 7
11 11 9 90 8 8
12 12 10 11 9 9
13 13 11 13 91 10
14 14 11 14 10 10
15 15 NA 15 15 11
16 16 NA 15 16 11
17 17 12 NA 17 12
18 18 13 NA 18 13
I couldnt find a question based on similar conditions therefor im asking it.
Many thanks in advance.
EDIT The answer of Ben works perfect if there are no NA values present. Unfortunately I did not mention I also had NA values present in for var1,2 or 3. A NA value meant that idnumber for Var1/2/3 was missing. So ive adjusted the question a bit and added some NA values.
The added question is:
Is it possible for a script to judge: if var1=c(NA,NA), var2=(1,1) and var3=(1,2) to report a duplicate but if var1=c(NA,NA), var2=c(1,2) and var3=(1,2) to report a unique number?

Maybe you could try the following. Here we use tail and head to refer to rows 2 through 14 compared to 1 through 13 (effectively comparing each row with the prior row).
We can use rowSums of differences between each row and the previous row. If the difference is zero, then the result is TRUE (or 1), and the ID would increase for each value of 1 from row to row. These are cumulatively summed with cumsum.
The use of c will make the first ID 1. Also, the cumsum is adjusted by 1 to account for the initial ID of 1.
sample$ID <-
c(1, cumsum(rowSums(tail(sample[-1], -1) == head(sample[-1], -1)) == 0) + 1)
sample
Output
pat var1 var2 var3 ID
1 1 1 10 1 1
2 2 16 10 11 1
3 3 21 27 2 2
4 4 22 29 2 2
5 5 31 35 3 3
6 6 44 47 4 4
7 7 5 50 5 5
8 8 6 60 6 6
9 9 7 70 7 7
10 10 8 80 7 7
11 11 9 90 8 8
12 12 10 11 9 9
13 13 11 13 91 10
14 14 11 14 10 10
Edit: Based on comment below, there are occasions where the value is NA which should be ignored. In the example above, NA repeated (such as var2 in rows 17-18) does not count as a duplicate.
Here is another approach. You can use sapply to go through the rows numbers of your data.frame.
You can use mapply to subtract each var from the row next to a given row, and check if any have a value of zero. Note that na.rm = T will ignore missing NA values.
sample$ID <-
c(1,
cumsum(
sapply(
seq_len(nrow(sample)-1),
\(x) {
!any(mapply(`-`, sample[x, -1, drop = T], sample[x + 1, -1, drop = T]) == 0, na.rm = T)
}
)
) + 1
)
Output
pat var1 var2 var3 ID
1 1 1 10 1 1
2 2 16 10 11 1
3 3 21 27 2 2
4 4 22 29 2 2
5 5 31 35 3 3
6 6 44 47 4 4
7 7 5 50 5 5
8 8 6 60 6 6
9 9 7 70 7 7
10 10 8 80 7 7
11 11 9 90 8 8
12 12 10 11 9 9
13 13 11 13 91 10
14 14 11 14 10 10
15 15 NA 15 15 11
16 16 NA 15 16 11
17 17 12 NA 17 12
18 18 13 NA 18 13

Related

filter() rows from dataframe with condition on previous and next row, keeping NA values

I have a dataframe like this:
AA<-c(1,2,4,5,6,7,10,11,12,13,14,15)
BB<-c(32,21,21,NA,27,31,31,12,28,NA,48,7)
df<- data.frame(AA,BB)
I want to remove rows where BB value is equal to previous or next row, to keep only first and last occurrences from each value of BB column. I also want to keep NA rows. I arrive to that code which is not so far from what I want:
lighten_df <- df %>% filter(BB!=lag(BB) | BB!=lead(BB) | is.na(BB) )
which gives me:
> lighten_df
AA BB
1 1 32
2 2 21
3 5 NA
4 6 27
5 7 31
6 10 31
7 11 12
8 12 28
9 13 NA
10 14 48
11 15 7
My problem is that I would like to keep first and last 21 value for col BB. That's the result I expect:
AA BB
1 1 32
2 2 21
3 4 21
4 5 NA
5 6 27
6 7 31
7 10 31
8 11 12
9 12 28
10 13 NA
11 14 48
12 15 7
Any Idea?
I would suggest a different approach: define a grouping variable and keep the first and last rows within each group:
df %>%
group_by(grp = data.table::rleid(BB)) %>%
slice(unique(c(1, n())))
# # A tibble: 12 × 3
# # Groups: grp [10]
# AA BB grp
# <dbl> <dbl> <int>
# 1 1 32 1
# 2 2 21 2
# 3 4 21 2
# 4 5 NA 3
# 5 6 27 4
# 6 7 31 5
# 7 10 31 5
# 8 11 12 6
# 9 12 28 7
# 10 13 NA 8
# 11 14 48 9
# 12 15 7 10

closest value from all previous rows r

I have a dataframe which I would like to add a column identifying the closest value to the respective column from only all previous values ignoring itself.
I found a closest value function but am unsure how to limit it to only previous rows. In the following example I would like to find the closest Revenue value considering only previous rows.
set.seed(1)
df<-data.frame(id=c(1:20),Revenue=sample(20))
closest<-function(xv,sv){
xv[which(abs(xv-sv)==min(abs(xv-sv)))] }
You can try the code below using dist + apply
transform(
df,
close_prev = Revenue[apply(`diag<-`(m <- as.matrix(dist(Revenue)), Inf) / upper.tri(m), 2, which.min)]
)
which gives
id Revenue close_prev
1 1 4 4
2 2 7 4
3 3 1 4
4 4 2 1
5 5 13 7
6 6 19 13
7 7 11 13
8 8 17 19
9 9 14 13
10 10 3 4
11 11 18 19
12 12 5 4
13 13 9 7
14 14 16 17
15 15 6 7
16 16 15 14
17 17 12 13
18 18 10 11
19 19 20 19
20 20 8 7
To get only 1 closest value for each number you can change the function using which.min and use the following.
library(dplyr)
library(purrr)
closest <- function(xv,sv) xv[which.min(abs(xv-sv))]
df %>%
mutate(close_prev = map_dbl(row_number(),
~closest(Revenue[seq_len(max(.x - 1, 1))], Revenue[.x])))
# id Revenue close_prev
#1 1 4 4
#2 2 7 4
#3 3 1 4
#4 4 2 1
#5 5 13 7
#6 6 19 13
#7 7 11 13
#8 8 17 19
#9 9 14 13
#10 10 3 4
#11 11 18 19
#12 12 5 4
#13 13 9 7
#14 14 16 17
#15 15 6 7
#16 16 15 14
#17 17 12 13
#18 18 10 11
#19 19 20 19
#20 20 8 7
All the previous values (.x - 1) are passed everytime in closest function. max(.x - 1, 1) is used to handle the 1st row since there is no value before that.

Subset data frame based on column values

I have a data frame consisting of the fluorescence read out of multiple cells tracked over time, for example:
Number=c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)
Fluorescence=c(9,10,20,30,8,11,21,31,6,12,22,32,7,13,23,33)
df = data.frame(Number, Fluorescence)
Which gets:
Number Fluorescence
1 1 9
2 2 10
3 3 20
4 4 30
5 1 8
6 2 11
7 3 21
8 4 31
9 1 6
10 2 12
11 3 22
12 4 32
13 1 7
14 2 13
15 3 23
16 4 33
Number pertains to the cell number. What I want is to collate the fluorescence readout based on the cell number. The data.frame here has it counting 1-4, whereas really I want something like this:
Number Fluorescence
1 1 9
2 1 8
3 1 6
4 1 7
5 2 10
6 2 11
7 2 12
8 2 13
9 3 20
10 3 21
11 3 22
12 3 23
13 4 30
14 4 31
15 4 32
16 4 33
Or even more ideal would be having columns based on Number, then respective cell fluorescence:
1 2 3 4
1 9 10 20 30
2 8 11 21 31
3 6 12 22 32
4 7 13 23 33
I've used the which function to extract them one at a time:
Cell1=df[which(df[,1]==1),2]
But this would require me to write a line for each cell (of which there are hundreds).
Thank you for any help with this! Apologies that I'm still a bit of an R noob.
How about this:
library(tidyr);library(data.table)
number <- c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)
fl <- c(9,10,20,30,8,11,21,31,6,12,22,32,7,13,23,33)
df <- data.table(number,fl)
df[, index:=1:.N, keyby=number]
df
number fl index
1: 1 9 1
2: 1 8 2
3: 1 6 3
4: 1 7 4
5: 2 10 1
6: 2 11 2
7: 2 12 3
8: 2 13 4
9: 3 20 1
10: 3 21 2
11: 3 22 3
12: 3 23 4
13: 4 30 1
14: 4 31 2
15: 4 32 3
16: 4 33 4
The index is added for the unique identifier in spread function from tidyr. Look this post for more information.
spread(df,number,fl)
index 1 2 3 4
1: 1 9 10 20 30
2: 2 8 11 21 31
3: 3 6 12 22 32
4: 4 7 13 23 33

R - Index position with condition

I've a data frame like this
w<-c(0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0)
i would like an index position starting after value 1.
output : NA,NA,NA,NA,NA,1,2,3,4,5,6,7,1,2,3,4,5,1,2,3,4,5,6,7,8,9
ideally applicable to a data frame.
Thanks
edit : w is a data frame,
roughly this function
m<-as.data.frame(w)
m[m!=1] <- row(m)[m!=1]
m
w
1 1
2 2
3 3
4 4
5 5
6 1
7 7
8 8
9 9
10 10
11 11
12 12
13 1
14 14
15 15
16 16
17 17
18 1
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
but with a return to 1 when value 1 is matching.
> m
w wanted
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
6 1 1
7 7 2
8 8 3
9 9 4
10 10 5
11 11 6
12 12 7
13 1 1
14 14 2
15 15 3
16 16 4
17 17 5
18 1 1
19 19 2
20 20 3
21 21 4
22 22 5
23 23 6
24 24 7
25 25 8
26 26 9
Thanks
This assumes that the data is ordered in the way shown in example.
m$wanted <- with(m, ave(w, cumsum(c(TRUE,diff(w) <0)), FUN=seq_along))
m$wanted
#[1] 1 2 3 4 5 1 2 3 4 5 6 7 1 2 3 4 5 1 2 3 4 5 6 7 8 9
For the given data including repeated 1's and non-sequential input, the following works:
m[9,1] <- 100
m[3,1] <- 55
m[14,1] <- 60
m[14,1] <- 60
m[25,1] <- 1
m[19,1] <- 1
m$result <- 1:nrow(m) - which(m$w == 1)[cumsum(m$w == 1)] + 1
But if the data does not start on 1:
m[1,1] <- 2
Then this works:
firstone <- which(m$w == 1)[1]
subindex <- m[firstone:nrow(m),'w'] == 1
m$result <- c(rep(NA,firstone-1),1:length(subindex) - which(subindex)[cumsum(subindex)] + 1)

divide dataframe into subgroups based on several columns successively in R

I have to sort a datapool with following structure into subgroups based on the value of 3 columns in R, but I cannot figure it out.
What I want to do is:
First, sort the datapool based on the column V1, the datapool should be divided into three subgroups according to the value of V1 (the value of V1 should be sorted by descending at first).
Sort each of the 3 subgroups into another 3 subgroups according to the value of V2, now we should have 9 subgroups.
Similarly, subdivide each of the 9 groups into 3 groups again,and resulting in 27 subgroups all together.
the following data is only a simple example, the data have 1545 firms.
Firm value V1 V2 V3
1 7 7 11 8
2 9 9 11 7
3 8 14 8 10
4 9 9 7 14
5 8 11 15 14
6 9 10 9 7
7 8 8 6 14
8 4 8 11 14
9 8 10 13 10
10 2 11 6 13
11 3 5 12 14
12 5 12 15 12
13 1 9 13 7
14 4 5 14 7
15 5 10 5 9
16 5 8 13 14
17 2 10 10 7
18 5 12 12 9
19 7 6 11 7
20 6 9 14 14
21 6 14 9 14
22 8 6 6 7
23 9 11 9 5
24 7 7 6 9
25 10 5 15 11
26 4 6 10 9
27 4 13 14 8
And the result should be:
Firm value V1 V2 V3
5 8 11 15 14
12 5 12 15 12
27 4 13 14 8
21 6 14 9 14
18 5 12 12 9
23 9 11 9 5
10 2 11 6 13
3 8 14 8 10
6 9 10 9 7
20 6 9 14 14
9 8 10 13 10
13 1 9 13 7
8 4 8 11 14
2 9 9 11 7
17 2 10 10 7
4 9 9 7 14
7 8 8 6 14
15 5 10 5 9
16 5 8 13 14
25 10 5 15 11
14 4 5 14 7
11 3 5 12 14
1 7 7 11 8
19 7 6 11 7
26 4 6 10 9
24 7 7 6 9
22 8 6 6 7
I have tried for a long time, also searched Google without success. :(
As #Codoremifa said, data.table can be used here:
require(data.table)
DT <- data.table(dat)
DT[order(V1),G1:=rep(1:3,each=9)]
DT[order(V2),G2:=rep(1:3,each=3),by=G1]
DT[order(V3),G3:=1:3,by='G1,G2']
Now your groups are labeled using the additional columns G1 and G2. To sort, so that it's easier to see the groups, use
setkey(DT,G1,G2,G3)
A couple of the OP's columns are just noise unrelated to the question; to verify that this works by eye, try DT[,list(V1,V2,V3,G1,G2,G3)]
EDIT: The OP did not specify a means of dealing with ties. I guess it makes sense to use the value in the later columns to break ties, so...
DT <- data.table(dat)
DT[order(rank(V1)+rank(V2)/100+rank(V3)/100^2),
G1:=rep(1:3,each=9)]
DT[order(rank(V2)+rank(V3)/100),
G2:=rep(1:3,each=3),by=G1]
DT[order(V3),
G3:=1:3,by='G1,G2']
setkey(DT,G1,G2,G3)
DT[27:1] (the result backwards) is
Firm value V1 V2 V3 G1 G2 G3
1: 5 8 11 15 14 3 3 3
2: 12 5 12 15 12 3 3 2
3: 27 4 13 14 8 3 3 1
4: 21 6 14 9 14 3 2 3
5: 9 8 10 13 10 3 2 2
6: 18 5 12 12 9 3 2 1
7: 10 2 11 6 13 3 1 3
8: 3 8 14 8 10 3 1 2
9: 23 9 11 9 5 3 1 1
10: 20 6 9 14 14 2 3 3
11: 16 5 8 13 14 2 3 2
12: 13 1 9 13 7 2 3 1
13: 8 4 8 11 14 2 2 3
14: 17 2 10 10 7 2 2 2
15: 2 9 9 11 7 2 2 1
16: 4 9 9 7 14 2 1 3
17: 15 5 10 5 9 2 1 2
18: 6 9 10 9 7 2 1 1
19: 11 3 5 12 14 1 3 3
20: 25 10 5 15 11 1 3 2
21: 14 4 5 14 7 1 3 1
22: 26 4 6 10 9 1 2 3
23: 1 7 7 11 8 1 2 2
24: 19 7 6 11 7 1 2 1
25: 7 8 8 6 14 1 1 3
26: 24 7 7 6 9 1 1 2
27: 22 8 6 6 7 1 1 1
Firm value V1 V2 V3 G1 G2 G3
Here is an answer using transform and then ddply from plyr. I don't address the ties, which really means that in case of a tie the value from the lowest row number is used first. This is what the OP shows in the example output.
First, order the dataset in descending order of V1 and create three groups of 9 by creating a new variable, fv1.
dat1 = transform(dat1[order(-dat1$V1),], fv1 = factor(rep(1:3, each = 9)))
Then order the dataset in descending order of V2 and create three groups of 3 within each level of fv1.
require(plyr)
dat1 = ddply(dat1[order(-dat1$V2),], .(fv1), transform, fv2 = factor(rep(1:3, each = 3)))
Finally order the dataset by the two factors and V3. I use arrange from plyr for typing efficiency compared to order
(finaldat = arrange(dat1, fv1, fv2, -V3) )
This isn't a particularly generalizable answer, as the group sizes are known in advance for the factors. If the V3 group size was larger than one, a similar process as for V2 would be needed.

Resources