Removing duplicate rows in dataframes with multiple columns - r

In a dataframe like this
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
df <-data.frame(a,b)
it is possible to remove duplicates using (based on the results of b column) :
df[!duplicated(df), ]
if a have a third column c in the df and I would like again to remove the duplicate based on the values of column b is it right to use this:
df[!duplicated(df$b), ]
using a third column.
The dataframe:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
c <- c("i","i","ii","ii","iii","iii","iv","iv")
df <-data.frame(a,b,c)
using remove duplicates based on column b:
df[!duplicated(df$b), ]
the result is this:
a b c
A 1 i
A 2 ii
B 4 ii
and I would expect this
a b c
A 1 i
A 2 ii
B 4 ii
B 1 iii
C 2 iv

Input:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
c <- c("i","i","ii","ii","iii","iii","iv","iv")
df <-data.frame(a,b,c)
Described as expected output in post:
a b c
A 1 i
A 2 ii
B 4 ii
B 1 iii
C 2 iv
Using distinct on all columns seems to do what you want:
>library(dplyr)
>distinct(df)
a b c
1 A 1 i
2 A 2 ii
3 B 4 ii
4 B 1 iii
5 C 2 iv
Other variation: only allow unique b's:
> distinct(df,b, .keep_all = TRUE)
a b c
1 A 1 i
2 A 2 ii
3 B 4 ii

Related

R merge() rbinds instead of merging

I ran across a behaviour of merge() in R that I can't understand. It seems that it either merges or rbinds data frames depending on whether a column has one or more unique values in it.
a1 <- data.frame (A = c (1, 1))
a2 <- data.frame (A = c (1, 2))
# > merge (a1, a1)
# A
# 1 1
# 2 1
# 3 1
# 4 1
# > merge (a2, a2)
# A
# 1 1
# 2 2
The latter is the result that I would expect, and want, in both cases. I also tried having more than one column, as well as characters instead of numerals, and the results are the same: multiple values result in merging, one unique value results in rbinding.
In the first case each row matches two rows so there are 2x2=4 rows in the output and in the second case each row matches one row so there are 2 rows in the output.
To match on row number use this:
merge(a1, a1, by = 0)
## Row.names A.x A.y
## 1 1 1 1
## 2 2 1 1
or match on row number and only return the left instance:
library(sqldf)
sqldf("select x.* from a1 x left join a1 y on x.rowid = y.rowid")
## A
## 1 1
## 2 1
or match on row number and return both instances:
sqldf("select x.A A1, y.A A2 from a1 x left join a1 y on x.rowid = y.rowid")
## A1 A2
## 1 1 1
## 2 1 1
The behaviour is detailed in the documentation but, basically, merge() will, by default, want to give you a data.frame with columns taken from both original dfs. It is going to merge rows of the two by unique values of all common columns.
df1 <- data.frame(a = 1:3, b = letters[1:3])
df2 <- data.frame(a = 1:5, c = LETTERS[1:5])
df1
a b
1 1 a
2 2 b
3 3 c
df2
a c
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
merge(df1, df2)
a b c
1 1 a A
2 2 b B
3 3 c C
What's happening in your first example is that merge() wants to combine the rows of your two data frames by the A column but because both rows in both dfs are the same, it can't figure out which row to merge with which so it creates all possible combinations.
In your second example, you don't have this problem and so merging is unambiguous. The 1 rows will get merged together as will the 2 rows.
The scenarios are more apparent when you have multiple columns in your dfs:
Case 1:
> df1 <- data.frame(a = c(1, 1), b = letters[1:2])
> df2 <- data.frame(a = c(1, 1), c = LETTERS[1:2])
> df1
a b
1 1 a
2 1 b
> df2
a c
1 1 A
2 1 B
> merge(df1, df2)
a b c
1 1 a A
2 1 a B
3 1 b A
4 1 b B
Case 2:
> df1 <- data.frame(a = c(1, 2), b = letters[1:2])
> df2 <- data.frame(a = c(1, 2), c = LETTERS[1:2])
> df1
a b
1 1 a
2 2 b
> df2
a c
1 1 A
2 2 B
> merge(df1, df2)
a b c
1 1 a A
2 2 b B

remove duplicate base on 2 columns of data [duplicate]

This question already has an answer here:
remove duplicate values based on 2 columns
(1 answer)
Closed 4 years ago.
I have a set of data:
x <- c(rep("A", 3), rep("B", 3), rep("C",2))
y <- c(1,1,2,4,1,1,2,2)
z <- c(rep("E", 1), rep("F", 4), rep("G",3))
df <-data.frame(x,y,z)
I only want to remove the duplicate row if both column x and column z are duplicated.
In this case, after applying the code, row 2,3 will left with 1 row, row 4,5 will left with 1 row, row 7,8 will left with 1 row
How to do it?
You can use a simple condition to subset your data:
x <- c(rep("A", 3), rep("B", 3), rep("C",2))
y <- c(1,1,2,4,1,1,2,2)
z <- c(rep("A", 1), rep("B", 4), rep("C",3))
df <-data.frame(x,y,z)
df
df[!df$x == df$z,] # the ! excludes all rows for which x == z is TRUE
x y z
2 A 1 B
3 A 2 B
6 B 1 C
Edit: As #RonakShah commented, to exclude duplicated rows, use
df[!duplicated(df[c("x", "z")]),]
or
df[!duplicated(df[c(1, 3)]),]
x y z
1 A 1 A
2 A 1 B
4 B 4 B
6 B 1 C
7 C 2 C

Subsetting data.table based on repeated rows

I have a list of data tables stored in an object ddf (a sample is shown below):
[[43]]
V1 V2 V3
1: b c a
2: b c a
3: b c a
4: b c a
5: b b a
6: b c a
7: b c a
[[44]]
V1 V2 V3
1: a c a
2: a c a
3: a c a
4: a c a
5: a c a
[[45]]
V1 V2 V3
1: a c b
2: a c b
3: a c b
4: a c b
5: a c b
6: a c b
7: a c b
8: a c b
9: a c b
.............and so on till [[100]]
I want to Subset the list ddf such that the result only consists of ddf's which:
have at least 9 rows each
each of the 9 rows are same
I want to store this sub-setted output
I have written some code for this below:
for(i in 1:100){
m=(as.numeric(nrow(df[[i]]))>= 9)
if(m == TRUE & df[[i]][1,] = df[[i]][2,] =
=df[[i]][3,] =df[[i]][4,] =df[[i]][5,] =df[[i]][6,]=
df[[i]][7,]=df[[i]][8,]=df[[i]][9,]){
print(df[[i]])
}}
Please tell me whats wrong & how I can generalize the result for sub-setting based on "n" similar rows.
[Follow-up Question]
Answer obtained from Main question:
> ddf[sapply(ddf, function(x) nrow(x) >= n & nrow(unique(x)) == 1)]
$`61`
V1 V2 V3
1: a c b
2: a c b
3: a c b
4: a c b
5: a c b
6: a c b
7: a c b
$`68`
V1 V2 V3
1: a c a
2: a c a
3: a c a
4: a c a
5: a c a
6: a c a
7: a c a
8: a c a
$`91`
V1 V2 V3
1: b c a
2: b c a
3: b c a
4: b c a
5: b c a
6: b c a
7: b c a
..... till the last data.frame which meet the row matching criteria (of at least 9 similar rows)
There are only 2 types of elements in the list:
**[[.. ]]**
**Case 1.** >70% accuracy
**Case 2.** <70% accuracy
You will notice that the Output shown above in the "Follow Up Question" is for
$'61', $'68' & $'91', but there is NO output for the other dataframes which don't match the "matching row" criteria.
I need an output where these missing values which don't match the output criteria give an output of "bad output".
Thus the Final list should be the same length as the input list.
By placing them side-by-side using paste I should be able to see each output.
We can loop through the list ('ddf'), subset only the duplicate rows with (duplicated), order the dataset, if the number of rows of the dataset 'x1' is greater than 8, then get the first 9 rows (head(x1, 9)) or else return 'bad result' printed
lapply(ddf, function(x) {
x1 <- x[duplicated(x)|duplicated(x, fromLast=TRUE)]
if(nrow(x1)>9) {
x1[order(V1, V2, V3), head(.SD, 9)]
} else "bad answer"
})
#[[1]]
# V1 V2 V3
#1: b c a
#2: b c a
#3: b c a
#4: b c a
#5: b c a
#6: b c a
#7: b c a
#8: b c a
#9: b c a
#[[2]]
#[1] "bad answer"
#[[3]]
#[1] "bad answer"
data
ddf <- list(data.table(V1 = 'b', V2 = rep(c('c', 'b', 'c'), c(8, 1, 2)), V3 = 'a'),
data.table(V1 = rep("a", 5), V2 = rep("c", 5), V3 = rep("a", 5)),
data.table(V1 = c('b', 'a', 'b', 'b'), V2 = c('b', 'a', 'c', 'b'),
V3 = c("c", "d", "a", "b")))
When ddf is your list of datatables, then:
ddf[sapply(ddf, nrow) >= 9 & sapply(ddf, function(x) nrow(unique(x))) == 1]
should give you the desired result.
Where:
sapply(ddf, nrow) >= 9 checks whether the datatables have nine or more rows
sapply(ddf, function(x) nrow(unique(x))) == 1 checks whether all the rows are the same.
Or with one sapply call as #docendodiscimus suggested:
ddf[sapply(ddf, function(x) nrow(x) >= 9 & nrow(unique(x)) == 1)]
Or by using the .N special symbol and the uniqueN function of data.table:
ddf[sapply(ddf, function(x) x[,.N] >= 9 & uniqueN(x) == 1)]
Another option is to use Filter (following the suggestion of #Frank in the comments):
Filter(function(x) nrow(x) >= 9 & uniqueN(x) == 1, ddf)
Two approaches to get the datatable numbers:
1. Using which:
which(sapply(ddf, function(x) nrow(x) >= 9 & nrow(unique(x)) == 1))
2. Assign names to the datatables in the list:
names(ddf) <- paste0('dt', 1:length(ddf))
now the output will have the datatable number in the output:
$dt4
V1 V2 V3
1 a c b
2 a c b
3 a c b
4 a c b
5 a c b
6 a c b
7 a c b
8 a c b
9 a c b

Change the /NA to special values in R

I have a data.table table with about 50,000 rows. There are two columns. There are lots of "/NA" in it.
Example:
V1 V2
A 1
B 2
A 1
C 3
A NA
B 2
C 3
A /NA
B /NA
A 1
I want to get
V1 V2
A 1
B 2
A 1
C 3
A 1
B 2
C 3
A 1
B 2
A 1
How can I finish it?
Thank you so much, Justin
tf <- tempfile()
writeLines(" V1 V2
A A
B B
A A
C C
A NA
B B
C C
A /NA
B /NA
A A", tf )
x <- read.table(tf, header=T, stringsAsFactors = F)
x$V2 <- ifelse(gsub("[/]","", x$V2) == "NA" | is.na(x$V2), x$V1, x$V2)
R> x
V1 V2
1 A A
2 B B
3 A A
4 C C
5 A A
6 B B
7 C C
8 A A
9 B B
10 A A
edit
A second ifelse() clause (or switch) is needed for the new question to parse V1 to V2. Note that I've switched the initial clause's evaluation via !:
x$V2 <- ifelse(!(gsub("[/]","", x$V2) == "NA" | is.na(x$V2)), x$V2,
ifelse(x$V1 == "A", 1, ifelse(x$V1 == "B", 2,3)))
You can use the data frame in R to get the same result
example <- data.frame(V1 = c("A","B","A","C","A","B","C","A","B","A"),
V2=c(1,2,1,3,"NA",2,3,"/NA","/NA",1), stringsAsFactors = FALSE)
example <- within(example, V2[V1=="A" & (V2=="NA" | V2=="/NA")] <-1)
example <- within(example, V2[V1=="B" & (V2=="NA" | V2=="/NA")] <-2)
example <- within(example, V2[V1=="C" & (V2=="NA" | V2=="/NA")] <-3)

change data.frame column into rows in R

A <- c(1,6)
B <- c(2,7)
C <- c(3,8)
D <- c(4,9)
E <- c(5,0)
df <- data.frame(A,B,C,D,E)
df
A B C D E
1 1 2 3 4 5
2 6 7 8 9 0
I would like to have this:
df
1 2
A 1 6
B 2 7
C 3 8
D 4 9
E 5 0
If your dataframe is truly in that format, then all of your vectors will be character vectors. Or, you basically have a character matrix and you could do this:
data.frame(t(df))
It would be better, though, to just define it the way you want it from the get-go
df <- data.frame(c('A','B','C','D','E'),
c(1, 2, 3, 4, 5),
c(6, 7, 8, 9, 0))
You could also do this
df <- data.frame(LETTERS[1:5], 1:5, c(6:9, 0))
If you wanted to give the columns names, you could do this
df <- data.frame(L = LETTERS[1:5], N1 = 1:5, N2 = c(6:9, 0))
Sometimes, if I use read.DIF of Excel data the data gets transposed. Is that how you got the original data in? If so, you can call
read.DIF(filename, transpose = T)
to get the data in the correct orientation.
I really recommend data.table approach without manual steps becauce they are error-prone
A <- c(1,6)
B <- c(2,7)
C <- c(3,8)
D <- c(4,9)
E <- c(5,0)
df <- data.frame(A,B,C,D,E)
df
library('data.table')
dat.m <- melt(as.data.table(df, keep.rownames = "Vars"), id.vars = "Vars") # https://stackoverflow.com/a/44128640/54964
dat.m
Output
A B C D E
1 1 2 3 4 5
2 6 7 8 9 0
Vars variable value
1: 1 A 1
2: 2 A 6
3: 1 B 2
4: 2 B 7
5: 1 C 3
6: 2 C 8
7: 1 D 4
8: 2 D 9
9: 1 E 5
10: 2 E 0
R: 3.4.0 (backports)
OS: Debian 8.7

Resources