How to flag duplicate values in r - newbie - r

I'm trying to flag duplicate IDs in another column. I don't necessarily want to remove them yet, just create an indicator (0/1) of whether the IDs are unique or duplicates. In sql, it would be like this:
SELECT ID, count(ID) count from TABLE group by ID) a
On TABLE.ID = a.ID
set ID Duplicate Flag Column 1 = 1
where count > 1;
Is there a way to do this simply in r?
Any help would be greatly appreciated.

As an example of duplicated let's start with some values (numbers here, but strings would do the same thing)
x <- c(9, 1:5, 3:7, 0:8)
x
# 9 1 2 3 4 5 3 4 5 6 7 0 1 2 3 4 5 6 7 8
If you want to flag the second and later copies
as.numeric(duplicated(x))
# 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 1 0
If you want to flag all values that occur two or more times
as.numeric(x %in% x[duplicated(x)])
# 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0

Related

If else Condition in R based on different columns and rows

I have a dataset with an ID column with multiple visits for every ID. I am trying to create a new variable Status, which will check the Visit column and Value column. The conditions are as follows
For visit in 1,2 & 3, if the values are 1,1,1 then 1
For visit in 1,2 & 3, if the values are 0,1,1 then 0
For visit in 1,2 & 3, if the values are 0,0,0 then 0
How do I specify this condition in R ?
Below is a sample dataset
ID
Visit
Value
1
1
1
1
2
1
1
3
1
2
1
1
2
2
0
2
3
0
3
1
0
3
2
0
3
3
0
4
1
0
4
2
1
4
3
1
Result dataset
ID
Visit
Value
Status
1
1
1
1
1
2
1
1
1
3
1
1
2
1
1
0
2
2
0
0
2
3
0
0
3
1
0
0
3
2
0
0
3
3
0
0
4
1
0
0
4
2
1
0
4
3
1
0
I'd have tried something like this (suppose your initial table is called df):
status = c()
for(i in 1:4){ #1:4 correspond to the ID you showed us
if(sum(df[df$ID == i,'value'])==3) status=c(status,rep(1,3))
if(sum(df[df$ID == i,'value'])!=3) status=c(status,rep(0,3))
}
df = cbind(df,status)
I hope that it will help you
I believe that case_when from the dplyr package is what you need to use. Here more details on that fuction: https://dplyr.tidyverse.org/reference/case_when.html

Conditionally delete individuals from longtidunal data [duplicate]

This question already has answers here:
Select groups which have at least one of a certain value
(3 answers)
Closed 1 year ago.
I have a longitudinal data set where I want to drop individuals (id) if they do no fulfill the criterion indicated by criteria == 1 at any time points. To put it in context we could say that criteria denotes if the individual was living in the region of interest at any time during.
Using some toy-data that have a similar structure as mine:
id <- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
time <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)
event <- c(0,1,0,1,0,0,0,0,0,0,1,0,1,0,1)
criteria <- c(1,0,0,0,0,0, 0, 0, 0, 1, 1, 1,0,0,1)
df <- data.frame(cbind(id,time,event, criteria))
> df
id time event criteria
1 1 1 0 1
2 1 2 1 0
3 1 3 0 0
4 2 1 1 0
5 2 2 0 0
6 2 3 0 0
7 3 1 0 0
8 3 2 0 0
9 3 3 0 0
10 4 1 0 1
11 4 2 1 1
12 4 3 0 1
13 5 1 1 0
14 5 2 0 0
15 5 3 1 1
So by removing any id that have criteria == 0 at all time points (time) would lead to an end result looking like this:
id time event criteria
1 1 1 0 1
2 1 2 1 0
3 1 3 0 0
4 4 1 0 1
5 4 2 1 1
6 4 3 0 1
7 5 1 1 0
8 5 2 0 0
9 5 3 1 1
I've been trying to achieve this by using dplyr::group_by(id) and then filter on the criterion but that does not achieve the result I want to. I'd prefer a tidyverse solution! :D
Thanks!
df %>%
group_by(id) %>%
# looking for the opposite (i.e. !) of criteria == 1 at least 1 time
mutate(is_good = !any(criteria == 1)) %>%
filter(is_good)
If you'd be willing to look into data.table's, which I recommend, it would be as simple as this:
library(data.table)
setDT(df) # make it a data.table
df[ , .SD[ !all(criteria==0) ], by=id ]
See this page for a general introduction and an explanation of the .SD idiom:
https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html

How to apply a function within a group that depends on which row within that group holds a value?

I have a dataset that looks like the following, where each ID has 3 levels, and where one of those levels has a value (and all other levels within that ID are 0):
ID level value
1 1 0
1 2 0
1 3 1
2 1 0
2 2 1
2 3 0
I need to return a similar dataframe, with an additional column which specifies which row within the ID has the value 1. In this case:
ID level value which
1 1 0 3
1 2 0 0
1 3 1 0
2 1 0 2
2 2 1 0
2 3 0 0
I feel like I should be able to create this somehow by group_by(ID) and then a mutate based on a case_when that refers to the rows relative to the group (i.e. if it is the 1st, 2nd, or 3rd row), but I can't crack how that should work.
Any suggestions are much appreciated!
You can use which or better which.max which is guaranteed to return only 1 value.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(which = which.max(value) * +(row_number() == 1))
# ID level value which
# <int> <int> <int> <int>
#1 1 1 0 3
#2 1 2 0 0
#3 1 3 1 0
#4 2 1 0 2
#5 2 2 1 0
#6 2 3 0 0
+(row_number() == 1) is to ensure that the value of which is assigned to only 1st row in the group and rest all the rows are 0.
We can use base R
df1$Which <- with(df1, tapply(as.logical(value), ID,
FUN = which)[ID] * !duplicated(ID))
-output
df1
# ID level value Which
#1 1 1 0 3
#2 1 2 0 0
#3 1 3 1 0
#4 2 1 0 2
#5 2 2 1 0
#6 2 3 0 0
Or another option with ave
df1$Which <- with(df1, ave(as.logical(value), ID, FUN = which) * !duplicated(ID))

How to write new column conditional on grouped rows in R?

I have a data frame where each Item has three categories (a, b,c) and a numeric Answer for each category is recorded (either 0 or 1). I would like to create a new column contingent on the rows in the Answer column. This is how my data frame looks like:
Item <- rep(c(1:3), each=3)
Option <- rep(c('a','b','c'), times=3)
Answer <- c(1,1,0,1,0,1,1,1,1)
df <- data.frame(Item, Option, Answer)
Item Option Answer
1 1 a 1
2 1 b 1
3 1 c 0
4 2 a 0
5 2 b 0
6 2 c 1
7 3 a 1
8 3 b 1
9 3 c 1
What is needed: whenever the three categories in the Option column are 1, the New column should receive a 1. In any other case, the column should have a 0. The desired output should look like this:
Item Option Answer New
1 1 a 1 0
2 1 b 1 0
3 1 c 0 0
4 2 a 0 0
5 2 b 0 0
6 2 c 1 0
7 3 a 1 1
8 3 b 1 1
9 3 c 1 1
I tried to achieve this without using a loop, but I got stuck because I don't know how to make a new column contingent on a group of rows, not just a single one. I have tried this solution but it doesn't work if the rows are not grouped in pairs.
Do you have any suggestions? Thanks a bunch!
This should work:
df %>%
group_by(Item)%>%
mutate(New = as.numeric(all(as.logical(Answer))))
using data.table
DT <- data.table(Item, Option, Answer)
DT[, Index := as.numeric(all(as.logical(Answer))), by= Item]
DT
Item Option Answer Index
1: 1 a 1 0
2: 1 b 1 0
3: 1 c 0 0
4: 2 a 1 0
5: 2 b 0 0
6: 2 c 1 0
7: 3 a 1 1
8: 3 b 1 1
9: 3 c 1 1
Or using only base R
df$Index <- with(df, +(ave(!!Answer, Item, FUN = all)))
df$Index
#[1] 0 0 0 0 0 0 1 1 1

calculating dataframe row combinations and matches with a separate column

I am trying to match all combinations of a dataframe (each combination reduces to a 1 or a 0 based on the sum) to another column and count the matches. I hacked this together but I feel like there is a better solution. Can someone suggest a better way to do this?
library(HapEstXXR)
test<-data.frame(a=c(0,1,0,1),b=c(1,1,1,1),c=c(0,1,0,1))
actual<-c(0,1,1,1)
ps<-powerset(1:dim(test)[2])
lapply(ps,function(x){
tt<-rowSums(test[,c(x)]) #Note: this fails when there is only one column
tt[tt>1]<-1 #if the sum is greater than 1 reduce it to 1
cbind(sum(tt==actual,na.rm=T),colnames(test)[x])
})
> test
a b c
1 0 1 0
2 1 1 1
3 0 1 0
4 1 1 1
goal: compare all combinations of columns (order doesnt matter) to actual column and see which matches most
b c a ab ac bc abc actual
1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 1
1 1 1 1 1 1 1 1
matches:
a: 3
b: 3
c: 3
ab: 3
....
Your code seems fine to me, I just simplified it a little bit:
sapply(ps,function(x){
tt <- rowSums(test[,x,drop=F]) > 0
colname <- paste(names(test)[x],collapse='')
setNames(sum(tt==actual,na.rm=T), colname) # make a named vector of one element length
})
# a b ab c ac bc abc
# 3 3 3 3 3 3 3

Resources