selecting other row element after aggregate in R [duplicate] - r

This question already has answers here:
Select rows with min value by group
(10 answers)
Subset data based on Minimum Value
(2 answers)
Closed 4 years ago.
I would like to select the youngest person in each group and categorize it by gender
so this is my initial data
data1
ID Age Gender Group
1 A01 25 m a
2 A02 35 f b
3 B03 45 m b
4 C99 50 m b
5 F05 60 f a
6 X05 65 f a
I would like to have this
Gender Group Age ID
m a 25 A01
f a 60 F05
m b 45 B03
f b 35 A02
So I tried with aggraeate function but I don't know how to attach the ID to it
aggregate(Age~Gender+Group,data1,min)
Gender Group Age
m a 25
f a 60
m b 45
f b 35

We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(data1)). If it is to get the row corresponding to the min of 'Age', we use which.min to get the row index of the min 'Age' grouped by 'Gender', 'Group' and then use that to subset the rows (.SD[which.min(Age)]).
setDT(data1)[, .SD[which.min(Age)], by = .(Gender, Group)]
Or another option would be to order by 'Gender', 'Group', 'Age', and then get the first row using unique.
unique(setDT(data1)[order(Gender,Group,Age)],
by = c('Gender', 'Group'))
Or using the same methodology with dplyr, we use slice with which.min to get the corresponding 'Age' grouped by 'Gender', 'Group'.
library(dplyr)
data1 %>%
group_by(Gender, Group) %>%
slice(which.min(Age))
Or we can arrange by 'Gender', 'Group', 'Age' and then get the first row
data1 %>%
arrange(Gender,Group, Age) %>%
group_by(Gender,Group) %>%
slice(1L)

Related

How to create a new variable conditioning on another variable in R? [duplicate]

I want to calculate mean (or any other summary statistics of length one, e.g. min, max, length, sum) of a numeric variable ("value") within each level of a grouping variable ("group").
The summary statistic should be assigned to a new variable which has the same length as the original data. That is, each row of the original data should have a value corresponding to the current group value - the data set should not be collapsed to one row per group. For example, consider group mean:
Before
id group value
1 a 10
2 a 20
3 b 100
4 b 200
After
id group value grp.mean.values
1 a 10 15
2 a 20 15
3 b 100 150
4 b 200 150
You may do this in dplyr using mutate:
library(dplyr)
df %>%
group_by(group) %>%
mutate(grp.mean.values = mean(value))
...or use data.table to assign the new column by reference (:=):
library(data.table)
setDT(df)[ , grp.mean.values := mean(value), by = group]
Have a look at the ave function. Something like
df$grp.mean.values <- ave(df$value, df$group)
If you want to use ave to calculate something else per group, you need to specify FUN = your-desired-function, e.g. FUN = min:
df$grp.min <- ave(df$value, df$group, FUN = min)
One option is to use plyr. ddply expects a data.frame (the first d) and returns a data.frame (the second d). Other XXply functions work in a similar way; i.e. ldply expects a list and returns a data.frame, dlply does the opposite...and so on and so forth. The second argument is the grouping variable(s). The third argument is the function we want to compute for each group.
require(plyr)
ddply(dat, "group", transform, grp.mean.values = mean(value))
id group value grp.mean.values
1 1 a 10 15
2 2 a 20 15
3 3 b 100 150
4 4 b 200 150
Here is another option using base functions aggregate and merge:
merge(x, aggregate(value ~ group, data = x, mean),
by = "group", suffixes = c("", "mean"))
group id value.x value.y
1 a 1 10 15
2 a 2 20 15
3 b 3 100 150
4 b 4 200 150
You can get "better" column names with suffixes:
merge(x, aggregate(value ~ group, data = x, mean),
by = "group", suffixes = c("", ".mean"))
group id value value.mean
1 a 1 10 15
2 a 2 20 15
3 b 3 100 150
4 b 4 200 150

How can I perform following operation in R? [duplicate]

I want to calculate mean (or any other summary statistics of length one, e.g. min, max, length, sum) of a numeric variable ("value") within each level of a grouping variable ("group").
The summary statistic should be assigned to a new variable which has the same length as the original data. That is, each row of the original data should have a value corresponding to the current group value - the data set should not be collapsed to one row per group. For example, consider group mean:
Before
id group value
1 a 10
2 a 20
3 b 100
4 b 200
After
id group value grp.mean.values
1 a 10 15
2 a 20 15
3 b 100 150
4 b 200 150
You may do this in dplyr using mutate:
library(dplyr)
df %>%
group_by(group) %>%
mutate(grp.mean.values = mean(value))
...or use data.table to assign the new column by reference (:=):
library(data.table)
setDT(df)[ , grp.mean.values := mean(value), by = group]
Have a look at the ave function. Something like
df$grp.mean.values <- ave(df$value, df$group)
If you want to use ave to calculate something else per group, you need to specify FUN = your-desired-function, e.g. FUN = min:
df$grp.min <- ave(df$value, df$group, FUN = min)
One option is to use plyr. ddply expects a data.frame (the first d) and returns a data.frame (the second d). Other XXply functions work in a similar way; i.e. ldply expects a list and returns a data.frame, dlply does the opposite...and so on and so forth. The second argument is the grouping variable(s). The third argument is the function we want to compute for each group.
require(plyr)
ddply(dat, "group", transform, grp.mean.values = mean(value))
id group value grp.mean.values
1 1 a 10 15
2 2 a 20 15
3 3 b 100 150
4 4 b 200 150
Here is another option using base functions aggregate and merge:
merge(x, aggregate(value ~ group, data = x, mean),
by = "group", suffixes = c("", "mean"))
group id value.x value.y
1 a 1 10 15
2 a 2 20 15
3 b 3 100 150
4 b 4 200 150
You can get "better" column names with suffixes:
merge(x, aggregate(value ~ group, data = x, mean),
by = "group", suffixes = c("", ".mean"))
group id value value.mean
1 a 1 10 15
2 a 2 20 15
3 b 3 100 150
4 b 4 200 150

R: Is there a function to create a new columns WITHIN a dataframe by calculate groupwise sums? [duplicate]

I want to calculate mean (or any other summary statistics of length one, e.g. min, max, length, sum) of a numeric variable ("value") within each level of a grouping variable ("group").
The summary statistic should be assigned to a new variable which has the same length as the original data. That is, each row of the original data should have a value corresponding to the current group value - the data set should not be collapsed to one row per group. For example, consider group mean:
Before
id group value
1 a 10
2 a 20
3 b 100
4 b 200
After
id group value grp.mean.values
1 a 10 15
2 a 20 15
3 b 100 150
4 b 200 150
You may do this in dplyr using mutate:
library(dplyr)
df %>%
group_by(group) %>%
mutate(grp.mean.values = mean(value))
...or use data.table to assign the new column by reference (:=):
library(data.table)
setDT(df)[ , grp.mean.values := mean(value), by = group]
Have a look at the ave function. Something like
df$grp.mean.values <- ave(df$value, df$group)
If you want to use ave to calculate something else per group, you need to specify FUN = your-desired-function, e.g. FUN = min:
df$grp.min <- ave(df$value, df$group, FUN = min)
One option is to use plyr. ddply expects a data.frame (the first d) and returns a data.frame (the second d). Other XXply functions work in a similar way; i.e. ldply expects a list and returns a data.frame, dlply does the opposite...and so on and so forth. The second argument is the grouping variable(s). The third argument is the function we want to compute for each group.
require(plyr)
ddply(dat, "group", transform, grp.mean.values = mean(value))
id group value grp.mean.values
1 1 a 10 15
2 2 a 20 15
3 3 b 100 150
4 4 b 200 150
Here is another option using base functions aggregate and merge:
merge(x, aggregate(value ~ group, data = x, mean),
by = "group", suffixes = c("", "mean"))
group id value.x value.y
1 a 1 10 15
2 a 2 20 15
3 b 3 100 150
4 b 4 200 150
You can get "better" column names with suffixes:
merge(x, aggregate(value ~ group, data = x, mean),
by = "group", suffixes = c("", ".mean"))
group id value value.mean
1 a 1 10 15
2 a 2 20 15
3 b 3 100 150
4 b 4 200 150

Using dplyr to summarise values and store as vector in data frame?

I have a simple data.frame that looks like this:
Group Person Score_1 Score_2 Score_3
1 1 90 80 79
1 2 74 83 28
1 3 74 94 89
2 1 33 9 8
2 2 94 32 78
2 3 50 90 87
I need to first need to find the mean of Score_1, collapsing across persons within a group (i.e., the Score_1 mean for Group 1, the Score_1 mean for Group 2, etc.), and then I need to collapse across all both groups to find the mean of Score_1. How can I calculate these values and store them as individual objects? I have used the "summarise" function in dplyr, with the following code:
summarise(group_by(data,Group),mean(bias,na.rm=TRUE))
I would like to ultimately create a 6th column that gives the mean, repeated across persons for each group, and then a 7th column that gives the grand mean across all groups.
I'm sure there are other ways to do this, and I am open to suggestions (although I would still like to know how to do it in dplyr). Thanks!
data.table is good for tasks like this:
library(data.table)
dt <- read.table(text = "Group Person Score_1 Score_2 Score_3
1 1 90 80 79
1 2 74 83 28
1 3 74 94 89
2 1 33 9 8
2 2 94 32 78
2 3 50 90 87", header = T)
dt <- data.table(dt)
# Mean by group
dt[, score.1.mean.by.group := mean(Score_1), by = .(Group)]
# Grand mean
dt[, score.1.mean := mean(Score_1)]
dt
To create a column, we use mutate and not summarise. We get the grand mean (MeanScore1), then grouped by 'Group', get the mean by group ('MeanScorebyGroup') and finally order the columns with select
library(dplyr)
df1 %>%
mutate(MeanScore1 = mean(Score_1)) %>%
group_by(Group) %>%
mutate(MeanScorebyGroup = mean(Score_1)) %>%
select(1:5, 7, 6)
But, this can also be done using base R in simple way
df1$MeanScorebyGroup <- with(df1, ave(Score_1, Group))
df1$MeanScore1 <- mean(df1$Score_1)
#akrun you just blew my mind!
Just to clarify what you said, here's my interpretation:
library(plyr)
Group <- c(1,1,1,2,2,2)
Person <- c(1,2,3,1,2,3)
Score_1 <- c(90,74,74,33,94,50)
Score_2 <- c(80,83,94,9,32,90)
Score_3 <- c(79,28,89,8,78,87)
df <- data.frame(cbind(Group, Person, Score_1, Score_2, Score_3))
df2 <- ddply(df, .(Group), mutate, meanScore = mean(Score_1, na.rm=T))
mutate(df2, meanScoreAll=mean(meanScore))

Excluding values when using ddply

Here is the data similar to that I am using :-
df <- data.frame(Name=c("Joy","Jane","Jane","Joy"),Grade=c(40,20,63,110))
Name Grade
1 Joy 40
2 Jane 20
3 Jane 63
4 Joy 110
Agg <- ddply(df, .(Name), summarize,Grade= max(Grade))
Name Grade
1 Jane 63
2 Joy 110
As the grade cannot be greater than 100, I need 40 as the value of for Joy and not 110. Basically I want to exclude all the values greater than 100 while summarizing. I can create a new data frame by excluding the values and then applying the ddply function, but would like to know if I can do it on my original data frame. Thanks in advance.
Using ddply, we can use the logical condition to subset the values of 'Grade'
library(plyr)
ddply(df, .(Name), summarise, Grade = max(Grade[Grade <=100]))
# Name Grade
#1 Jane 63
#2 Joy 40
Or with dplyr, we filter the "Grade" that are less than or equal to 100, then grouped by "Name", get the max of "Grade"
library(dplyr)
df %>%
filter(Grade <= 100) %>%
group_by(Name) %>%
summarise(Grade = max(Grade))
# Name Grade
# <fctr> <dbl>
#1 Jane 63
#2 Joy 40
Or instead of the filter, we can create the logical condition in summarise
df %>%
group_by(Name) %>%
summarise(Grade = max(Grade[Grade <=100]))
Or with data.table, convert the 'data.frame' to 'data.table' (setDT(df)), create the logical condition (Grade <= 100) in 'i', grouped by "Name", get the max of "Grade".
library(data.table)
setDT(df)[Grade <= 100, .(Grade = max(Grade)), by = Name]
# Name Grade
#1: Joy 40
#2: Jane 63
Or using sqldf
library(sqldf)
sqldf("select Name,
max(Grade) as Grade
from df
where Grade <= 100
group by Name")
# Name Grade
#1 Jane 63
#2 Joy 40
In base R, another variant of aggregate would be
aggregate(Grade ~ Name, df, subset = Grade <= 100, max)
# Name Grade
#1 Jane 63
#2 Joy 40
You can also use base R aggregate for the same
aggregate(Grade ~ Name, df[df$Grade <= 100, ], max)
# Name Grade
#1 Jane 63
#2 Joy 40

Resources