remove duplicate id based on certain criteria [duplicate] - r

This question already has answers here:
R function which.max with tapply
(3 answers)
Remove duplicates based on specific criteria
(3 answers)
Closed 7 years ago.
id <- c(1,1,2,3,4,4,5,6,7,7,7,8,9)
age <- c(10,10.6,11,11.3,10.9,11.4,10.7,11,10.5,11.1,12.3,10.3,10.7)
ageto11 <- abs(age-11)
df <- as.data.frame(cbind(id,age,ageto11))
df
id age ageto11
1 1 10.0 1.0
2 1 10.6 0.4
3 2 11.0 0.0
4 3 11.3 0.3
5 4 10.9 0.1
6 4 11.4 0.4
7 5 10.7 0.3
8 6 11.0 0.0
9 7 10.5 0.5
10 7 11.1 0.1
11 7 12.3 1.3
12 8 10.3 0.7
13 9 10.7 0.3
I am trying to remove the duplicated id in the above data frame, based on the criteria of selecting the smallest distance to age 11 (i.e. the smallest value of ageto11)
For example, when id=1, I would like to remove the first row, in which ageto11 is larger.
When id=7, I would like to keep the 10th row, in which ageto11 is the smallest.
The desired result should be like
id age ageto11
2 1 10.6 0.4
3 2 11.0 0.0
4 3 11.3 0.3
5 4 10.9 0.1
7 5 10.7 0.3
8 6 11.0 0.0
10 7 11.1 0.1
12 8 10.3 0.7
13 9 10.7 0.3

We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by the 'id', get the difference of 'age' with 11, find the index of the minimum absolute value (which.min(abs..) and subset the dataset (.SD).
library(data.table)
setDT(df)[,.SD[which.min(abs(age-11))] , id]
# id age ageto11
#1: 1 10.6 0.4
#2: 2 11.0 0.0
#3: 3 11.3 0.3
#4: 4 10.9 0.1
#5: 5 10.7 0.3
#6: 6 11.0 0.0
#7: 7 11.1 0.1
#8: 8 10.3 0.7
#9: 9 10.7 0.3
EDIT: Just notified by #Pascal that the distance is already calculated in 'ageto11'. In that case
setDT(df)[, .SD[which.min(ageto11)], id]

Related

Create a new row from the average of specific rows from all columns [duplicate]

This question already has answers here:
How can I get each numeric column's mean in one data?
(2 answers)
Doing operation on multiple numbered tables in R
(1 answer)
Closed 2 years ago.
Let's say I have a dataset.
w=c(5,6,7,8)
x=c(1,2,3,4)
y=c(1,2,3,5)
length(y)=4
z=data.frame(w,x,y)
This will return
w x y
1 5 1 1
2 6 2 2
3 7 3 3
4 8 4 5
I would like to have a 5th row that averages only row 2 and 3.
w x y
1 5 1 1
2 6 2 2
3 7 3 3
4 8 4 5
5 6.5 2.5 2.5
How would I approach this?
There are a lot of examples with rowMeans, but I'm looking to average all columns, and from only specific rows.
You can use colMeans as :
rows <- c(2, 3)
rbind(z, colMeans(z[rows,]))
# w x y
#1 5.0 1.0 1.0
#2 6.0 2.0 2.0
#3 7.0 3.0 3.0
#4 8.0 4.0 5.0
#5 6.5 2.5 2.5
Does this work:
library(dplyr)
z %>% bind_rows(sapply(z[2:3,], mean))
w x y
1 5.0 1.0 1.0
2 6.0 2.0 2.0
3 7.0 3.0 3.0
4 8.0 4.0 5.0
5 6.5 2.5 2.5

Filter a group of a data.frame based on multiple conditions

I am looking for an elegant way to filter the values of a specific group of big data.frame based on multiple conditions.
My data frame looks like this.
data=data.frame(group=c("A","B","C","A","B","C","A","B","C"),
time= c(rep(1,3),rep(2,3), rep(3,3)),
value=c(0.2,1,1,0.1,10,20,10,20,30))
group time value
1 A 1 0.2
2 B 1 1.0
3 C 1 1.0
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
I would like only for the time point 1 to filter out all the values that are smaller than 1 but bigger than 0.1
I want my data.frame to look like this.
group time value
1 A 1 0.2
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
Any help is highly appreciated.
With dplyr you can do
library(dplyr)
data %>% filter(!(time == 1 & (value <= 0.1 | value >= 1)))
# group time value
# 1 A 1 0.2
# 2 A 2 0.1
# 3 B 2 10.0
# 4 C 2 20.0
# 5 A 3 10.0
# 6 B 3 20.0
# 7 C 3 30.0
Or if you have too much free time and you decided to avoid dplyr:
ind <- with(data, (data$time==1 & (data$value > 0.1 & data$value < 1)))
ind <- ifelse((data$time==1) & (data$value > 0.1 & data$value < 1), TRUE, FALSE)
#above two do the same
data$ind <- ind
data <- data[!(data$time==1 & ind==F),]
data$ind <- NULL
group time value
1 A 1 0.2
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
Another simple option would be to use subset twice and then append the results in a row wise manner.
rbind(
subset(data, time == 1 & value > 0.1 & value < 1),
subset(data, time != 1)
)
# group time value
# 1 A 1 0.2
# 4 A 2 0.1
# 5 B 2 10.0
# 6 C 2 20.0
# 7 A 3 10.0
# 8 B 3 20.0
# 9 C 3 30.0

R - Count duplicated rows keeping index of their first occurrences

I have been looking for an efficient way of counting and removing duplicate rows in a data frame while keeping the index of their first occurrences.
For example, if I have a data frame:
df<-data.frame(x=c(9.3,5.1,0.6,0.6,8.5,1.3,1.3,10.8),y=c(2.4,7.1,4.2,4.2,3.2,8.1,8.1,5.9))
ddply(df,names(df),nrow)
gives me
x y V1
1 0.6 4.2 2
2 1.3 8.1 2
3 5.1 7.1 1
4 8.5 3.2 1
5 9.3 2.4 1
6 10.8 5.9 1
But I want to keep the original indices (along with the row names) of the duplicated rows. like:
x y V1
1 9.3 2.4 1
2 5.1 7.1 1
3 0.6 4.2 2
5 8.5 3.2 1
6 1.3 8.1 2
8 10.8 5.9 1
"duplicated" returns the original rownames (here {1 2 3 5 6 8}) but doesnt count the number of occurences. I tried writing functions on my own but none of them are efficient enough to handle big data. My data frame can have up to couple of million rows (though columns are usually 5 to 10).
If you want to keep the index:
library(data.table)
setDT(df)[,.(.I, .N), by = names(df)][!duplicated(df)]
# x y I N
#1: 9.3 2.4 1 1
#2: 5.1 7.1 2 1
#3: 0.6 4.2 3 2
#4: 8.5 3.2 5 1
#5: 1.3 8.1 6 2
#6: 10.8 5.9 8 1
Or using data.tables unique method
unique(setDT(df)[,.(.I, .N), by = names(df)], by = names(df))
We can try with data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'x', 'y' column, we get the nrow (.N ).
library(data.table)
setDT(df)[, list(V1=.N), by = .(x,y)]
# x y V1
#1: 9.3 2.4 1
#2: 5.1 7.1 1
#3: 0.6 4.2 2
#4: 8.5 3.2 1
#5: 1.3 8.1 2
#6: 10.8 5.9 1
If we need the row ids,
setDT(df)[, list(V1= .N, rn=.I[1L]), by = .(x,y)]
# x y V1 rn
#1: 9.3 2.4 1 1
#2: 5.1 7.1 1 2
#3: 0.6 4.2 2 3
#4: 8.5 3.2 1 5
#5: 1.3 8.1 2 6
#6: 10.8 5.9 1 8
Or
setDT(df, keep.rownames=TRUE)[, list(V1=.N, rn[1L]), .(x,y)]

Dividing data and putting into two boxplot

> sleep
extra group ID
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
4 -1.2 1 4
5 -0.1 1 5
6 3.4 1 6
7 3.7 1 7
8 0.8 1 8
9 0.0 1 9
10 2.0 1 10
11 1.9 2 1
12 0.8 2 2
13 1.1 2 3
14 0.1 2 4
15 -0.1 2 5
16 4.4 2 6
17 5.5 2 7
18 1.6 2 8
19 4.6 2 9
20 3.4 2 10
I have this set of Data and Im supposed to Divide it by the effects that GROUP have on different people and put it into two different boxplot but as you can see theres group 1 and group 2 and they are on the same data which is group so I dont know how to divede the data into group 1 and group 2 can u help me with this?
You don't need to divide the data to put it into a boxplot:
boxplot(extra~group,data=sleep)
You can explore the different options available by using ?boxplot.
Some people like to use the ggplot2 package:
library(ggplot2)
ggplot(sleep,aes(x=group,y=extra,group=group))+geom_boxplot()
Others prefer lattice:
bwplot(group~extra,data=sleep)
This is a good dataset to use ggplot2 with.
library(ggplot2)
ggplot(sleep, aes(x=factor(group), y=extra)) + geom_boxplot()

Selecting Rows which contain daily max value in R

So I want to subset my data frame to select rows with a daily maximum value.
Site Year Day Time Cover Size TempChange
ST1 2011 97 0.0 Closed small 0.97
ST1 2011 97 0.5 Closed small 1.02
ST1 2011 97 1.0 Closed small 1.10
Section of data frame is above. I would like to select only the rows which have the maximum value of the variable TempChange for each variable Day. I want to do this because I am interested in specific variables (not shown) for these particular times.
AMENDED EXAMPLE AND REQUIRED OUTPUT
Site Day Temp Row
a 10 0.2 1
a 10 0.3 2
a 11 0.5 3
a 11 0.4 4
b 10 0.1 5
b 10 0.8 6
b 11 0.7 7
b 11 0.6 8
c 10 0.2 9
c 10 0.3 10
c 11 0.5 11
c 11 0.8 12
REQUIRED OUTPUT
Site Day Temp Row
a 10 0.3 2
a 11 0.5 3
b 10 0.8 6
b 11 0.7 7
c 10 0.3 10
c 11 0.8 12
Hope that makes it clearer.
After faffing with raw data frame code, I realised plyr could do this in one:
> df
Day V Z
1 97 0.26575207 1
2 97 0.09443351 2
3 97 0.88097858 3
4 98 0.62241515 4
5 98 0.61985937 5
6 99 0.06956219 6
7 100 0.86638108 7
8 100 0.08382254 8
> ddply(df,~Day,function(x){x[which.max(x$V),]})
Day V Z
1 97 0.88097858 3
2 98 0.62241515 4
3 99 0.06956219 6
4 100 0.86638108 7
To get the rows for max values for unique combinations of more than one column, just add the variable to the formula. For your modified example, its then:
> df
Site Day Temp Row
1 a 10 0.2 1
2 a 10 0.3 2
3 a 11 0.5 3
4 a 11 0.4 4
5 b 10 0.1 5
6 b 10 0.8 6
7 b 11 0.7 7
8 b 11 0.6 8
9 c 10 0.2 9
10 c 10 0.3 10
11 c 11 0.5 11
12 c 11 0.8 12
> ddply(df,~Day+Site,function(x){x[which.max(x$Temp),]})
Site Day Temp Row
1 a 10 0.3 2
2 b 10 0.8 6
3 c 10 0.3 10
4 a 11 0.5 3
5 b 11 0.7 7
6 c 11 0.8 12
Note this isn't in the same order as your original dataframe, but you can fix that.
> dmax = ddply(df,~Day+Site,function(x){x[which.max(x$Temp),]})
> dmax[order(dmax$Row),]
Site Day Temp Row
1 a 10 0.3 2
4 a 11 0.5 3
2 b 10 0.8 6
5 b 11 0.7 7
3 c 10 0.3 10
6 c 11 0.8 12

Resources