Removing duplicates based on two columns in R [duplicate] - r

This question already has answers here:
Remove duplicate column pairs, sort rows based on 2 columns [duplicate]
(3 answers)
Closed 7 years ago.
Suppose my data is as follows,
X Y
26 14
26 14
26 15
26 15
27 15
27 15
28 16
28 16
I want to remove the rows of duplicates. I am able to remove the duplicate rows based on one column by this command,
dat[c(T, diff(dat$X) != 0), ] or dat[c(T, diff(dat$Y) != 0), ]
But I want to remove the duplicates only when both the columns have the same previous value. I can't use unique here because the same data would occur later on. I want to check the previous value and calculate it
My sample output is,
x y
26 14
26 15
27 15
28 16
How can we do this in R?
Thanks
Ijaz

Using data.table v1.9.5 - installation instructions here:
require(data.table) # v1.9.5+
df[!duplicated(rleidv(df, cols = c("X", "Y"))), ]
rleidv() is best understood with examples:
rleidv(c(1,1,1,2,2,3,1,1))
# [1] 1 1 1 2 2 3 4 4
A unique index is generated for each consecutive run of values.
And the same can be accomplished on a list() or data.frame() or data.table() on a specific set of columns as well. For example:
df = data.frame(a = c(1,1,2,2,1), b = c(2,3,4,4,2))
rleidv(df) # computes on both columns 'a,b'
# [1] 1 2 3 3 4
rleidv(df, cols = "a") # only looks at 'a'
# [1] 1 1 2 2 3
The rest should be fairly obvious. We just check for duplicated() values, and return the non-duplicated ones.

using dplyr:
library(dplyr)
z %>% filter(X != lag(X) | Y != lag(Y) | row_number() == 1)
We need to include the row_number()==1 or we lose the first row

Related

Efficient way to fill column with numbers that identify observations with same value in column [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Add ID column by group [duplicate]
(4 answers)
Closed 4 years ago.
I apologize for the wording of the question and the errors. Newbie in OS and in R.
Problem: Find efficient way to fill column with numbers that uniquely identify observations with same value in another column.
Result would look like this:
patient_number id
1 46 1
2 47 2
3 15 3
4 42 4
5 33 5
6 26 6
7 37 7
8 7 8
9 33 5
10 36 9
Sample data frame
set.seed(42)
df <- data.frame(
patient_number = sample(seq(1, 50, 1), 100, replace = TRUE)
)
What I was able to come up with
df$id <- NA ## create id and fill with NA make if statement easier
n_unique <- length(unique(df$patient_number)) ## how many unique obs
for (i in 1:nrow(df)) {
index_identical <- which(df$patient_number == df$patient_number[i])
## get index of obs with same patient_number
if (any(is.na(df$id[index_identical]))) {
## if any of the ids of obs with same patient number not filled in,
df$id[index_identical] <- setdiff(seq(1, n_unique, 1), df$id)[1]
## get a integer between 1 and the number of unique obs that is not used
}
else {
df$id <- df$id
}
}
It does the job, but with thousands of rows, it takes time.
Thanks for bearing with me.
If you're open to other packages, you can use the group_indices function from the dplyr package:
library(dplyr)
df %>%
mutate(id = group_indices(., patient_number))
patient_number id
1 46 40
2 47 41
3 15 14
4 42 37
5 33 28
6 26 23
7 37 32
8 7 6
9 33 28
10 36 31
11 23 21
12 36 31
13 47 41
...
We can use .GRP from data.table
library(data.table)
setDT(df)[, id := .GRP, patient_number]
Or with base R match and factor options are fast as well
df$id <- with(df, match(patient_number, unique(patient_number)))
df$id <- with(df, as.integer(factor(patient_number,
levels = unique(patient_number))))

R: How to replace NA value in specific condition [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 4 years ago.
everyone. I want to replace the NA value with value which is not NA for the same participants. I tried this, but it return the original df, i don't know what happened.
df = data.frame(block = c('1',NA,NA,'2',NA,NA,'3',NA,NA),
subject = c('31','31','31','32','32','32','33','33','33'))
df[df$subject == 1 & is.na(df$block)] = df[df$subject == 31 &!is.na(df$block)]
# define a for loop with from 1 to n
for (i in 1: length(unique(df$subject))){
subjects
# replace the block with NA in block that is not NA for the same participant
df[df$subject == i & is.na(df$block)] = df[df$subject == i & !is.na(df$block)]
}
Here is what i want to get.
enter image description here
Using the dplyr and the zoo library, I replaced the NA values in the block column with the previous non-NA row values:
library(dplyr)
library(zoo)
df2 <- df %>%
do(na.locf(.))
The end result looks as follow:
df2
block subject
1 1 31
2 1 31
3 1 31
4 2 32
5 2 32
6 2 32
7 3 33
8 3 33
9 3 33

Find unique rows in a data frame in R [duplicate]

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 6 years ago.
I'd like to create a new data frame column that helps me quickly identify duplicate rows based on the value of the first column per row (index). Assuming that my dataframe (df) has almost 18000 rows-observations and the new column is called "unique" I have tried the following rather unsuccessfully...
df$unique = ifelse(df[row.names(df):1]==df[row.names(df)-1:1], "YES", "NO")
The rationale behind the code is that a comparison between the cell of the same row and the one before in the same column, can give out unique entries as long as these values do not match.
My dataframe
index num1 num2
1 12 12
1 12 12
2 14 14
2 14 14
2 14 14
3 18 18
4 19 19
You can use the duplicated function. Be aware that the first occurence of a non-unique column is not a duplicate, hence we need it twice, searching from the beginning and from the end.
# Toy data, where the first two rows are identical, the third row is unique
df <- data.frame(a = c(1, 1, 1), b = c(1, 1, 2))
# Find unique columns
df$unique <- !(duplicated(df) | duplicated(df, fromLast = TRUE))
Output:
> df
a b unique
1 1 1 FALSE
2 1 1 FALSE
3 1 2 TRUE

group and label rows in data frame by numeric in R

I need to group and label every x observations(rows) in a dataset in R.
I need to know if the last group of rows in the dataset has less than x observations
For example:
If I use a dataset with 10 observations and 2 variables and I want to group by every 3 rows.
I want to add a new column so that the dataset looks like this:
speed dist newcol
4 2 1
4 10 1
7 4 1
7 22 2
8 16 2
9 10 2
10 18 3
10 26 3
10 34 3
11 17 4
df$group <- rep(1:(nrow(df)/3), each = 3)
This works if the number of rows is an exact multiple of 3. Every three rows will get tagged in serial numbers.
A quick dirty way to tackle the problem of not knowing how incomplete the final group is to simply check the remained when nrow is modulus divided by group size: nrow(df) %% 3 #change the divisor to your group size
assuming your data is df you can do
df$newcol = rep(1:ceiling(nrow(df)/3), each = 3)[1:nrow(df)]

Specific removing all duplicates with R [duplicate]

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Remove all duplicate rows including the "reference" row [duplicate]
(3 answers)
Closed 7 years ago.
For example I have two columns:
Var1 Var2
1 12
1 65
2 68
2 98
3 49
3 24
4 8
5 67
6 12
And I need to display only values which are unique for column Var1:
Var1 Var2
4 8
5 67
6 12
I can do you like this:
mydata=mydata[!unique(mydata$Var1),]
But when I use the same formula for my large data set with about 1 million observations, nothing happens - the sample size is still the same. Could you please explain my why?
Thank you!
With data.table (as it seem to be tagged with it) I would do
indx <- setDT(DT)[, .I[.N == 1], by = Var1]$V1
DT[indx]
# Var1 Var2
# 1: 4 8
# 2: 5 67
# 3: 6 12
Or... as #eddi reminded me, you can simply do
DT[, if(.N == 1) .SD, by = Var1]
Or (per the mentioned duplicates) with v >= 1.9.5 you could also do something like
setDT(DT, key = "Var1")[!(duplicated(DT) | duplicated(DT, fromLast = TRUE))]
You can use this:
df <- data.frame(Var1=c(1,1,2,2,3,3,4,5,6), Var2=c(12,65,68,98,49,24,8,67,12) );
df[ave(1:nrow(df),df$Var1,FUN=length)==1,];
## Var1 Var2
## 7 4 8
## 8 5 67
## 9 6 12
This will work even if the Var1 column is not ordered, because ave() does the necessary work to collect groups of equal elements (even if they are non-consecutive in the grouping vector) and map the result of the function call (length() in this case) back to each element that was a member of the group.
Regarding your code, it doesn't work because this is what unique() and its negation returns:
unique(df$Var1);
## [1] 1 2 3 4 5 6
!unique(df$Var1);
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
As you can see, unique() returns the actual unique values from the argument vector. Negation returns true for zero and false for everything else.
Thus, you end up row-indexing using a short logical vector (it will be short if there were any duplicates removed by unique()) consisting of TRUE where there were zeroes, and FALSE otherwise.

Resources