Removing duplicate rows on the basis of specific columns - r

How can I remove the duplicate rows on the basis of specific columns while maintaining the dataset. I tried using these links1, link2
What I want to do is I want to see the ambiguity on the basis of column 3 to 6. If their values are same then the processed dataset should remove the rows, as shown in the example:
I used this code but I gave me half result:
Data <- unique(Data[, 3:6])
Lets suppose my dataset is like this
A B C D E F G H I J K L M
1 2 2 1 5 4 12 A 3 5 6 2 1
1 2 2 1 5 4 12 A 2 35 36 22 21
1 22 32 31 5 34 12 A 3 5 6 2 1
What I want in my output is:
A B C D E F G H I J K L M
1 2 2 1 5 4 12 A 3 5 6 2 1
1 22 32 31 5 34 12 A 3 5 6 2 1

Another option is unique from data.table. It has the by option. We convert the 'data.frame' to 'data.table' (setDT(df1)), use unique and specify the columns within the by
library(data.table)
unique(setDT(df1), by= names(df1)[3:6])
# A B C D E F G H I J K L M
#1: 1 2 2 1 5 4 12 A 3 5 6 2 1
#2: 1 22 32 31 5 34 12 A 3 5 6 2 1
unique returns a data.table with duplicated rows removed.

Assuming that your data is stored as a dataframe, you could try:
Data <- Data[!duplicated(Data[,3:6]),]
#> Data
# A B C D E F G H I J K L M
#1 1 2 2 1 5 4 12 A 3 5 6 2 1
#3 1 22 32 31 5 34 12 A 3 5 6 2 1
The function duplicated() returns a logical vector containing in this case information for each row about whether the combination of the entries in column 3 to 6 reappears elsewhere in the dataset. The negation ! of this logical vector is used to select the rows from your dataset, resulting in a dataset with unique combinations of the entries in column 3 to 6.
Thanks to #thelatemail for pointing out a mistake in my previous post.

Related

Finding the rank of each element of row names of a matrix in R

I have a numeric matrix of 2000 rows and 6900 columns. I am using R and I want to find the rank of each element of row names in each column.
For example; if element "A" is the biggest value in the first column, third biggest in the second column and 5th biggest value in the third column.
I want to replace the value of "A" with 1,3,5 in the first 3 columns. I want to do the same for all row names and all columns. Basically, I want to find the rank of a value in each column.
Is there a way to do it? I tried rank, which, sort and order function but could not make it.
I am using (and prefer) R but Python is also okay.
Thank you in advance
How about this.
Example of your data based on your description:
set.seed(1)
m <- matrix(sample(1:25), 5, dimnames = list(LETTERS[1:5], 1:5))
m
#> 1 2 3 4 5
#> A 25 11 16 9 8
#> B 4 14 10 15 13
#> C 7 18 6 12 21
#> D 1 22 19 17 3
#> E 2 5 23 20 24
Solution:
apply(m, 2, rank)
#> 1 2 3 4 5
#> A 5 2 3 1 2
#> B 3 3 2 3 3
#> C 4 4 1 2 4
#> D 1 5 4 4 1
#> E 2 1 5 5 5

R Get minimum value in dataframe selecting rows on 2 columns [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 2 years ago.
I have a dataframe like the one I've simplified below. I want to first select rows with the same value based on column X, then in that selection select rows with the same value based on column Y. Then from that selection, I want to take the minimal value. I'm now using a forloop, but seems there must be an easier way. Thanks!
set.seed(123)
data<-data.frame(X=rep(letters[1:3], each=8),Y=rep(c(1,2)),Z=sample(1:100, 12))
data
X Y Z
1 a 1 76
2 a 1 22
3 a 2 32
4 a 2 23
5 b 1 14
6 b 1 40
7 b 2 39
8 b 2 35
9 c 1 15
10 c 1 13
11 c 2 21
12 c 2 42
Desired outcome:
X Y Z
2 a 1 22
4 a 2 23
5 b 1 14
8 b 2 35
10 c 1 13
11 c 2 21
Here is a data.table solution:
library(data.table)
data = data.table(data)
data[, min(Z), by=c("X", "Y")]
EDIT based on OP's comment:
If there is a NA value in one of the columns we sort by, an additional row is created:
data[2,2] <-NA
data[, min(Z,na.rm = T), by=c("X", "Y")]
X Y V1
1: a 1 31
2: a NA 79
3: a 2 14
4: b 1 31
5: b 2 14
6: c 1 50
7: c 2 25
library(tidyverse)
data %>%
group_by(X, Y) %>%
summarise(Z = min(Z))
Will do the trick! The other answer right now is the data.table way, this is tidyverse. Both are extremely powerful ways to approach data cleaning & manipulation - it could be helpful to familiarize yourself with one!
In base you can use aggregate to get min from Z grouped by the remaining columns like:
aggregate(Z~.,data,min)
# X Y Z
#1 a 1 31
#2 b 1 31
#3 c 1 50
#4 a 2 14
#5 b 2 14
#6 c 2 25
In case there is an NA in the groups:
data[2,2] <-NA
Ignore it:
aggregate(Z~.,data,min)
# X Y Z
#1 a 1 31
#2 b 1 31
#3 c 1 50
#4 a 2 14
#5 b 2 14
#6 c 2 25
Show it:
aggregate(data$Z, list(X=data$X, Y=addNA(data$Y)), min)
# X Y x
#1 a 1 31
#2 b 1 31
#3 c 1 50
#4 a 2 14
#5 b 2 14
#6 c 2 25
#7 a <NA> 79
This code could benefit from splitting it up in multiple lines, but it works. in Base-R
do.call(rbind,
lapply(unlist(lapply(split(data,data$X), function(x) split(x,x$Y)),recursive=F), function(y) y[y$Z==min(y$Z),])
)
X Y Z
a.1 a 1 31
a.2 a 2 14
b.1 b 1 31
b.2 b 2 14
c.1 c 1 50
c.2 c 2 25

How do I properly define new columns based on multiple conditions in R [duplicate]

This question already has answers here:
Define and apply custom bins on a dataframe
(4 answers)
Closed 3 years ago.
I'm having the following problem with my R-code (which I have fixed by 12 nested if else statements, which is by far not desirable). As I cannot share the full code nor the data, I have given a similar problem. Suppose I have the following column in my dataset, the frequency
> test_df
ID Frequency
1 1 1
2 2 56
3 3 34
4 4 22
5 5 9
6 6 8
7 7 50
8 8 7
Now, I want to mutate a new column based on a tabel, which categorizes the Frequency, namely
htbl
freq_interval category
1 6 A
2 18 B
3 20 C
4 30 D
5 40 E
Now, I want to mutate a new column based on this table, in the following way: if the frequency is less than 6, give the new column the value "A". If the frequency is less than 18, but more than 6, give the new column the value "B". If the frequency is less than 20 but more than 18, give it the value "C" and so on. So, my desired new test_df will be:
ID Frequency mutated_column
1 1 1 A
2 2 56 <NA>
3 3 34 E
4 4 22 D
5 5 9 B
6 6 8 B
7 7 50 <NA>
8 8 7 B
How can I do this cleanly?
Thanks in advance
We can use findInterval or cut here
test_df$mutated_column <- htbl$category[findInterval(test_df$Frequency,
htbl$freq_interval) + 1]
test_df
# ID Frequency mutated_column
#1 1 1 A
#2 2 56 <NA>
#3 3 34 E
#4 4 22 D
#5 5 9 B
#6 6 8 B
#7 7 50 <NA>
#8 8 7 B
With cut that would be
cut(test_df$Frequency, breaks = c(-Inf, htbl$freq_interval),labels = htbl$category)
#[1] A <NA> E D B B <NA> B
#Levels: A B C D E

How to delete duplicates but keep most recent data in R

I have the following two data frames:
df1 = data.frame(names=c('a','b','c','c','d'),year=c(11,12,13,14,15), Times=c(1,1,3,5,6))
df2 = data.frame(names=c('a','e','e','c','c','d'),year=c(12,12,13,15,16,16), Times=c(2,2,4,6,7,7))
I would like to know how I could merge the above df but only keeping the most recent Times depending on the year. It should look like this:
Names Year Times
a 12 2
b 12 2
c 16 7
d 16 7
e 13 4
I'm guessing that you do not mean to merge these but rather combine by stacking. Your question is ambiguous since the "duplication" could occur at the dataframe level or at the vector level. You example does not display any duplication at the dataframe level but would at the vector level. The best way to describe the problem is that you want the last (or max) Times entry within each group if names values:
> df1
names year Times
1 a 11 1
2 b 12 1
3 c 13 3
4 c 14 5
5 d 15 6
> df2
names year Times
1 a 12 2
2 e 12 2
3 e 13 4
4 c 15 6
5 c 16 7
6 d 16 7
> dfr <- rbind(df1,df2)
> dfr <-dfr[order(dfr$Times),]
> dfr[!duplicated(dfr, fromLast=TRUE) , ]
names year Times
1 a 11 1
2 b 12 1
6 a 12 2
7 e 12 2
3 c 13 3
8 e 13 4
4 c 14 5
5 d 15 6
9 c 15 6
10 c 16 7
11 d 16 7
> dfr[!duplicated(dfr$names, fromLast=TRUE) , ]
names year Times
2 b 12 1
6 a 12 2
8 e 13 4
10 c 16 7
11 d 16 7
This uses base R functions; there are also newer packages (such as plyr) that many feel make the split-apply-combine process more intuitive.
df <- rbind(df1, df2)
do.call(rbind, lapply(split(df, df$names), function(x) x[which.max(x$year), ]))
## names year Times
## a a 12 2
## b b 12 1
## c c 16 7
## d d 16 7
## e e 13 4
We could also use aggregate:
df <- rbind(df1,df2)
aggregate(cbind(df$year,df$Times)~df$names,df,max)
# df$names V1 V2
# 1 a 12 2
# 2 b 12 1
# 3 c 16 7
# 4 d 16 7
# 5 e 13 4
In case you wanted to see a data.table solution,
# load library
library(data.table)
# bind by row and convert to data.table (by reference)
df <- setDT(rbind(df1, df2))
# get the result
df[order(names, year), .SD[.N], by=.(names)]
The output is as follows:
names year Times
1: a 12 2
2: b 12 1
3: c 16 7
4: d 16 7
5: e 13 4
The final line orders the row-binded data by names and year, and then chooses the last observation (.sd[.N]) for each name.

Multiplication of different subsets with different data in R

I have a large dataset, which I splitted up into subsets. For each subsets, I have to do the same calculations but with different numbers. Example:
Main Table
x a b c d
A 1 2 4 5
A 4 5 1 7
A 3 5 6 2
B 4 5 2 9
B 3 5 2 8
C 4 2 5 2
C 1 9 6 9
C 1 2 3 4
C 6 3 6 2
Additional Table for A
a b c d
A 5 1 6 1
Additional Table for B
a b c d
B 1 5 2 6
Additional Table for C
a c c d
C 8 2 4 1
I need to multiply all rows A in the Main Table with the values from Additional Table for A, all rows B in the Main table with the values from B and all rows B in the main table with values from C. It is completely fine to merge the additional tables into a combined one, if this makes the solution easier.
I thought about a for-loop but I am not able to put the different multiplicators (from the Additional Tables) into the code. Since there is a large number of subgroups, coding each multiplication manually should be avoided. How do I do this multiplications?
If we start with the addition table as addDf and main table as df:
addDf
x a b c d
A A 5 1 6 1
B B 1 5 2 6
C C 8 2 4 1
We can use a merge and the by-element multiplication of matrix as,
df[-1] <- merge(addDf, data.frame(x = df[1]), by = "x")[-1] * df[order(df[1]), -1]
df
x a b c d
1 A 5 2 24 5
2 A 20 5 6 7
3 A 15 5 36 2
4 B 4 25 4 54
5 B 3 25 4 48
6 C 32 4 20 2
7 C 8 18 24 9
8 C 8 4 12 4
9 C 48 6 24 2
Note: Borrowed a little syntax sugar from #akrun as df[-1] assignment.
We can use Map after splitting the main data 'df' (assuming that all of the datasets are data.frames.
df[-1] <- unsplit(Map(function(x,y) x*y[col(x)],
split(df[-1], df$x),
list(unlist(dfA), unlist(dfB), unlist(dfC))), df$x)
df
# x a b c d
#1 A 5 2 24 5
#2 A 20 5 6 7
#3 A 15 5 36 2
#4 B 4 25 4 54
#5 B 3 25 4 48
#6 C 32 4 20 2
#7 C 8 18 24 9
#8 C 8 4 12 4
#9 C 48 6 24 2
Or we can use a faster option with data.table
library(data.table)
setnames(setDT(do.call(rbind, list(dfA, dfB, dfC)), keep.rownames=TRUE)[df,
.(a= a*i.a, b= b*i.b, c = c*i.c, d= d*i.d), on = c('rn' = 'x'), by = .EACHI], 1, 'x')[]
# x a b c d
#1: A 5 2 24 5
#2: A 20 5 6 7
#3: A 15 5 36 2
#4: B 4 25 4 54
#5: B 3 25 4 48
#6: C 32 4 20 2
#7: C 8 18 24 9
#8: C 8 4 12 4
#9: C 48 6 24 2
The above would be difficult if there many columns, in that case, we could use mget to retrieve the columns and do the * on the i. columns with Map
setDT(do.call(rbind, list(dfA, dfB, dfC)), keep.rownames=TRUE)[df,
Map(`*`, mget(names(df)[-1]), mget(paste0("i.", names(df)[-1]))) ,
on = c('rn' = 'x'), by = .EACHI]

Resources