R Get minimum value in dataframe selecting rows on 2 columns [duplicate] - r

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 2 years ago.
I have a dataframe like the one I've simplified below. I want to first select rows with the same value based on column X, then in that selection select rows with the same value based on column Y. Then from that selection, I want to take the minimal value. I'm now using a forloop, but seems there must be an easier way. Thanks!
set.seed(123)
data<-data.frame(X=rep(letters[1:3], each=8),Y=rep(c(1,2)),Z=sample(1:100, 12))
data
X Y Z
1 a 1 76
2 a 1 22
3 a 2 32
4 a 2 23
5 b 1 14
6 b 1 40
7 b 2 39
8 b 2 35
9 c 1 15
10 c 1 13
11 c 2 21
12 c 2 42
Desired outcome:
X Y Z
2 a 1 22
4 a 2 23
5 b 1 14
8 b 2 35
10 c 1 13
11 c 2 21

Here is a data.table solution:
library(data.table)
data = data.table(data)
data[, min(Z), by=c("X", "Y")]
EDIT based on OP's comment:
If there is a NA value in one of the columns we sort by, an additional row is created:
data[2,2] <-NA
data[, min(Z,na.rm = T), by=c("X", "Y")]
X Y V1
1: a 1 31
2: a NA 79
3: a 2 14
4: b 1 31
5: b 2 14
6: c 1 50
7: c 2 25

library(tidyverse)
data %>%
group_by(X, Y) %>%
summarise(Z = min(Z))
Will do the trick! The other answer right now is the data.table way, this is tidyverse. Both are extremely powerful ways to approach data cleaning & manipulation - it could be helpful to familiarize yourself with one!

In base you can use aggregate to get min from Z grouped by the remaining columns like:
aggregate(Z~.,data,min)
# X Y Z
#1 a 1 31
#2 b 1 31
#3 c 1 50
#4 a 2 14
#5 b 2 14
#6 c 2 25
In case there is an NA in the groups:
data[2,2] <-NA
Ignore it:
aggregate(Z~.,data,min)
# X Y Z
#1 a 1 31
#2 b 1 31
#3 c 1 50
#4 a 2 14
#5 b 2 14
#6 c 2 25
Show it:
aggregate(data$Z, list(X=data$X, Y=addNA(data$Y)), min)
# X Y x
#1 a 1 31
#2 b 1 31
#3 c 1 50
#4 a 2 14
#5 b 2 14
#6 c 2 25
#7 a <NA> 79

This code could benefit from splitting it up in multiple lines, but it works. in Base-R
do.call(rbind,
lapply(unlist(lapply(split(data,data$X), function(x) split(x,x$Y)),recursive=F), function(y) y[y$Z==min(y$Z),])
)
X Y Z
a.1 a 1 31
a.2 a 2 14
b.1 b 1 31
b.2 b 2 14
c.1 c 1 50
c.2 c 2 25

Related

How to create column, from the cumulative column in r?

df <- data.frame(dat=c("11-03","12-03","13-03"),
c=c(0,15,20,4,19,21,2,10,14), d=rep(c("A","B","C"),each=3))
suppose c has the cumulative values. I want to create a column daily that will look like
dat c d daily
1 11-03 0 A 0
2 12-03 15 A 15
3 13-03 20 A 5
4 11-03 4 B 4
5 12-03 19 B 15
6 13-03 21 B 2
7 11-03 2 C 2
8 12-03 10 C 8
9 13-03 14 C 4
for each value of d and dat (date wise) a daily change in value is generated from the column c has that cumulative value.
We can get the diff of 'c' after grouping by 'd'
library(dplyr)
df %>%
group_by(d) %>%
mutate(daily = c(first(c), diff(c)))
# A tibble: 9 x 4
# Groups: d [3]
# dat c d daily
# <fct> <dbl> <fct> <dbl>
#1 11-03 0 A 0
#2 12-03 15 A 15
#3 13-03 20 A 5
#4 11-03 4 B 4
#5 12-03 19 B 15
#6 13-03 21 B 2
#7 11-03 2 C 2
#8 12-03 10 C 8
#9 13-03 14 C 4
Or do the difference between the 'c' and the lag of 'c'
df %>%
group_by(d) %>%
mutate(daily = c - lag(c))
Data.table solution:
df <- as.data.table(df)
df[, daily:= c - shift(c, fill = 0),by=d]
Shift is datatable's lag operator, so basically we subtract from C its previous value within each group.
fill = 0 replaces NAs with zeros, because within each group, there is no previous value (shift(c)) for the first element.

What is the way of reordering columns as the same data in each different columns?

I want to find a way to extract only the same data from the values ​​in column A and column B and sum the extracted values in column C. This is just sample of my huge data..
X=data.frame(A=c(1:5), A1=c(53,15,25,3,5))
X
A A1
1 1 53
2 2 15
3 3 25
4 4 3
5 5 5
Y=data.frame(B=c(5,1,2,11,62,22), B2=c(13,11,23,42,11,88))
B B2
1 5 13
2 1 11
3 2 23
4 11 42
5 62 11
extracted same values between A and B and show with sum of A1 and B2 data (C=A1+B2)
My expect result is :
A B C
1 1 1 64
2 2 2 38
3 5 5 18
THANKS!!
You could do something like this:
X <- data.frame(A=c(1:5), A1=c(53,15,25,3,5))
Y <- data.frame(B=c(5,1,2,11,62,22), B2=c(13,11,23,42,11,88))
Z <- merge(X,Y, by.x = 'A', by.y = 'B')
Z$C <- Z$A1 + Z$B2
data.frame(A = Z$A, B = Z$A, C = Z$C)
A B C
1 1 1 64
2 2 2 38
3 5 5 18

How to delete duplicates but keep most recent data in R

I have the following two data frames:
df1 = data.frame(names=c('a','b','c','c','d'),year=c(11,12,13,14,15), Times=c(1,1,3,5,6))
df2 = data.frame(names=c('a','e','e','c','c','d'),year=c(12,12,13,15,16,16), Times=c(2,2,4,6,7,7))
I would like to know how I could merge the above df but only keeping the most recent Times depending on the year. It should look like this:
Names Year Times
a 12 2
b 12 2
c 16 7
d 16 7
e 13 4
I'm guessing that you do not mean to merge these but rather combine by stacking. Your question is ambiguous since the "duplication" could occur at the dataframe level or at the vector level. You example does not display any duplication at the dataframe level but would at the vector level. The best way to describe the problem is that you want the last (or max) Times entry within each group if names values:
> df1
names year Times
1 a 11 1
2 b 12 1
3 c 13 3
4 c 14 5
5 d 15 6
> df2
names year Times
1 a 12 2
2 e 12 2
3 e 13 4
4 c 15 6
5 c 16 7
6 d 16 7
> dfr <- rbind(df1,df2)
> dfr <-dfr[order(dfr$Times),]
> dfr[!duplicated(dfr, fromLast=TRUE) , ]
names year Times
1 a 11 1
2 b 12 1
6 a 12 2
7 e 12 2
3 c 13 3
8 e 13 4
4 c 14 5
5 d 15 6
9 c 15 6
10 c 16 7
11 d 16 7
> dfr[!duplicated(dfr$names, fromLast=TRUE) , ]
names year Times
2 b 12 1
6 a 12 2
8 e 13 4
10 c 16 7
11 d 16 7
This uses base R functions; there are also newer packages (such as plyr) that many feel make the split-apply-combine process more intuitive.
df <- rbind(df1, df2)
do.call(rbind, lapply(split(df, df$names), function(x) x[which.max(x$year), ]))
## names year Times
## a a 12 2
## b b 12 1
## c c 16 7
## d d 16 7
## e e 13 4
We could also use aggregate:
df <- rbind(df1,df2)
aggregate(cbind(df$year,df$Times)~df$names,df,max)
# df$names V1 V2
# 1 a 12 2
# 2 b 12 1
# 3 c 16 7
# 4 d 16 7
# 5 e 13 4
In case you wanted to see a data.table solution,
# load library
library(data.table)
# bind by row and convert to data.table (by reference)
df <- setDT(rbind(df1, df2))
# get the result
df[order(names, year), .SD[.N], by=.(names)]
The output is as follows:
names year Times
1: a 12 2
2: b 12 1
3: c 16 7
4: d 16 7
5: e 13 4
The final line orders the row-binded data by names and year, and then chooses the last observation (.sd[.N]) for each name.

Multiplication of different subsets with different data in R

I have a large dataset, which I splitted up into subsets. For each subsets, I have to do the same calculations but with different numbers. Example:
Main Table
x a b c d
A 1 2 4 5
A 4 5 1 7
A 3 5 6 2
B 4 5 2 9
B 3 5 2 8
C 4 2 5 2
C 1 9 6 9
C 1 2 3 4
C 6 3 6 2
Additional Table for A
a b c d
A 5 1 6 1
Additional Table for B
a b c d
B 1 5 2 6
Additional Table for C
a c c d
C 8 2 4 1
I need to multiply all rows A in the Main Table with the values from Additional Table for A, all rows B in the Main table with the values from B and all rows B in the main table with values from C. It is completely fine to merge the additional tables into a combined one, if this makes the solution easier.
I thought about a for-loop but I am not able to put the different multiplicators (from the Additional Tables) into the code. Since there is a large number of subgroups, coding each multiplication manually should be avoided. How do I do this multiplications?
If we start with the addition table as addDf and main table as df:
addDf
x a b c d
A A 5 1 6 1
B B 1 5 2 6
C C 8 2 4 1
We can use a merge and the by-element multiplication of matrix as,
df[-1] <- merge(addDf, data.frame(x = df[1]), by = "x")[-1] * df[order(df[1]), -1]
df
x a b c d
1 A 5 2 24 5
2 A 20 5 6 7
3 A 15 5 36 2
4 B 4 25 4 54
5 B 3 25 4 48
6 C 32 4 20 2
7 C 8 18 24 9
8 C 8 4 12 4
9 C 48 6 24 2
Note: Borrowed a little syntax sugar from #akrun as df[-1] assignment.
We can use Map after splitting the main data 'df' (assuming that all of the datasets are data.frames.
df[-1] <- unsplit(Map(function(x,y) x*y[col(x)],
split(df[-1], df$x),
list(unlist(dfA), unlist(dfB), unlist(dfC))), df$x)
df
# x a b c d
#1 A 5 2 24 5
#2 A 20 5 6 7
#3 A 15 5 36 2
#4 B 4 25 4 54
#5 B 3 25 4 48
#6 C 32 4 20 2
#7 C 8 18 24 9
#8 C 8 4 12 4
#9 C 48 6 24 2
Or we can use a faster option with data.table
library(data.table)
setnames(setDT(do.call(rbind, list(dfA, dfB, dfC)), keep.rownames=TRUE)[df,
.(a= a*i.a, b= b*i.b, c = c*i.c, d= d*i.d), on = c('rn' = 'x'), by = .EACHI], 1, 'x')[]
# x a b c d
#1: A 5 2 24 5
#2: A 20 5 6 7
#3: A 15 5 36 2
#4: B 4 25 4 54
#5: B 3 25 4 48
#6: C 32 4 20 2
#7: C 8 18 24 9
#8: C 8 4 12 4
#9: C 48 6 24 2
The above would be difficult if there many columns, in that case, we could use mget to retrieve the columns and do the * on the i. columns with Map
setDT(do.call(rbind, list(dfA, dfB, dfC)), keep.rownames=TRUE)[df,
Map(`*`, mget(names(df)[-1]), mget(paste0("i.", names(df)[-1]))) ,
on = c('rn' = 'x'), by = .EACHI]

Removing duplicate rows on the basis of specific columns

How can I remove the duplicate rows on the basis of specific columns while maintaining the dataset. I tried using these links1, link2
What I want to do is I want to see the ambiguity on the basis of column 3 to 6. If their values are same then the processed dataset should remove the rows, as shown in the example:
I used this code but I gave me half result:
Data <- unique(Data[, 3:6])
Lets suppose my dataset is like this
A B C D E F G H I J K L M
1 2 2 1 5 4 12 A 3 5 6 2 1
1 2 2 1 5 4 12 A 2 35 36 22 21
1 22 32 31 5 34 12 A 3 5 6 2 1
What I want in my output is:
A B C D E F G H I J K L M
1 2 2 1 5 4 12 A 3 5 6 2 1
1 22 32 31 5 34 12 A 3 5 6 2 1
Another option is unique from data.table. It has the by option. We convert the 'data.frame' to 'data.table' (setDT(df1)), use unique and specify the columns within the by
library(data.table)
unique(setDT(df1), by= names(df1)[3:6])
# A B C D E F G H I J K L M
#1: 1 2 2 1 5 4 12 A 3 5 6 2 1
#2: 1 22 32 31 5 34 12 A 3 5 6 2 1
unique returns a data.table with duplicated rows removed.
Assuming that your data is stored as a dataframe, you could try:
Data <- Data[!duplicated(Data[,3:6]),]
#> Data
# A B C D E F G H I J K L M
#1 1 2 2 1 5 4 12 A 3 5 6 2 1
#3 1 22 32 31 5 34 12 A 3 5 6 2 1
The function duplicated() returns a logical vector containing in this case information for each row about whether the combination of the entries in column 3 to 6 reappears elsewhere in the dataset. The negation ! of this logical vector is used to select the rows from your dataset, resulting in a dataset with unique combinations of the entries in column 3 to 6.
Thanks to #thelatemail for pointing out a mistake in my previous post.

Resources