Here is the data similar to that I am using :-
df <- data.frame(Name=c("Joy","Jane","Jane","Joy"),Grade=c(40,20,63,110))
Name Grade
1 Joy 40
2 Jane 20
3 Jane 63
4 Joy 110
Agg <- ddply(df, .(Name), summarize,Grade= max(Grade))
Name Grade
1 Jane 63
2 Joy 110
As the grade cannot be greater than 100, I need 40 as the value of for Joy and not 110. Basically I want to exclude all the values greater than 100 while summarizing. I can create a new data frame by excluding the values and then applying the ddply function, but would like to know if I can do it on my original data frame. Thanks in advance.
Using ddply, we can use the logical condition to subset the values of 'Grade'
library(plyr)
ddply(df, .(Name), summarise, Grade = max(Grade[Grade <=100]))
# Name Grade
#1 Jane 63
#2 Joy 40
Or with dplyr, we filter the "Grade" that are less than or equal to 100, then grouped by "Name", get the max of "Grade"
library(dplyr)
df %>%
filter(Grade <= 100) %>%
group_by(Name) %>%
summarise(Grade = max(Grade))
# Name Grade
# <fctr> <dbl>
#1 Jane 63
#2 Joy 40
Or instead of the filter, we can create the logical condition in summarise
df %>%
group_by(Name) %>%
summarise(Grade = max(Grade[Grade <=100]))
Or with data.table, convert the 'data.frame' to 'data.table' (setDT(df)), create the logical condition (Grade <= 100) in 'i', grouped by "Name", get the max of "Grade".
library(data.table)
setDT(df)[Grade <= 100, .(Grade = max(Grade)), by = Name]
# Name Grade
#1: Joy 40
#2: Jane 63
Or using sqldf
library(sqldf)
sqldf("select Name,
max(Grade) as Grade
from df
where Grade <= 100
group by Name")
# Name Grade
#1 Jane 63
#2 Joy 40
In base R, another variant of aggregate would be
aggregate(Grade ~ Name, df, subset = Grade <= 100, max)
# Name Grade
#1 Jane 63
#2 Joy 40
You can also use base R aggregate for the same
aggregate(Grade ~ Name, df[df$Grade <= 100, ], max)
# Name Grade
#1 Jane 63
#2 Joy 40
Related
I have a simple data.frame that looks like this:
Group Person Score_1 Score_2 Score_3
1 1 90 80 79
1 2 74 83 28
1 3 74 94 89
2 1 33 9 8
2 2 94 32 78
2 3 50 90 87
I need to first need to find the mean of Score_1, collapsing across persons within a group (i.e., the Score_1 mean for Group 1, the Score_1 mean for Group 2, etc.), and then I need to collapse across all both groups to find the mean of Score_1. How can I calculate these values and store them as individual objects? I have used the "summarise" function in dplyr, with the following code:
summarise(group_by(data,Group),mean(bias,na.rm=TRUE))
I would like to ultimately create a 6th column that gives the mean, repeated across persons for each group, and then a 7th column that gives the grand mean across all groups.
I'm sure there are other ways to do this, and I am open to suggestions (although I would still like to know how to do it in dplyr). Thanks!
data.table is good for tasks like this:
library(data.table)
dt <- read.table(text = "Group Person Score_1 Score_2 Score_3
1 1 90 80 79
1 2 74 83 28
1 3 74 94 89
2 1 33 9 8
2 2 94 32 78
2 3 50 90 87", header = T)
dt <- data.table(dt)
# Mean by group
dt[, score.1.mean.by.group := mean(Score_1), by = .(Group)]
# Grand mean
dt[, score.1.mean := mean(Score_1)]
dt
To create a column, we use mutate and not summarise. We get the grand mean (MeanScore1), then grouped by 'Group', get the mean by group ('MeanScorebyGroup') and finally order the columns with select
library(dplyr)
df1 %>%
mutate(MeanScore1 = mean(Score_1)) %>%
group_by(Group) %>%
mutate(MeanScorebyGroup = mean(Score_1)) %>%
select(1:5, 7, 6)
But, this can also be done using base R in simple way
df1$MeanScorebyGroup <- with(df1, ave(Score_1, Group))
df1$MeanScore1 <- mean(df1$Score_1)
#akrun you just blew my mind!
Just to clarify what you said, here's my interpretation:
library(plyr)
Group <- c(1,1,1,2,2,2)
Person <- c(1,2,3,1,2,3)
Score_1 <- c(90,74,74,33,94,50)
Score_2 <- c(80,83,94,9,32,90)
Score_3 <- c(79,28,89,8,78,87)
df <- data.frame(cbind(Group, Person, Score_1, Score_2, Score_3))
df2 <- ddply(df, .(Group), mutate, meanScore = mean(Score_1, na.rm=T))
mutate(df2, meanScoreAll=mean(meanScore))
How can I transform data X to Y as in
X = data.frame(
ID = c(1,1,1,2,2),
NAME = c("MIKE","MIKE","MIKE","LUCY","LUCY"),
SEX = c("MALE","MALE","MALE","FEMALE","FEMALE"),
TEST = c(1,2,3,1,2),
SCORE = c(70,80,90,65,75)
)
Y = data.frame(
ID = c(1,2),
NAME = c("MIKE","LUCY"),
SEX = c("MALE","FEMALE"),
TEST_1 =c(70,65),
TEST_2 =c(80,75),
TEST_3 =c(90,NA)
)
The dcast function in reshape2 seems to work but it can not include other columns in the data like ID, NAME and SEX in the example above.
Assuming all other columns by a ID column are consistent, like Mike can only be a male with ID 1, how can we do it?
According to the documentation (?reshape2::dcast), dcast() allows for ... in the formula:
"..." represents all other variables not used in the formula ...
This is true for both the reshape2 and the data.table packages which both support dcast().
So, you can write:
reshape2::dcast(X, ... ~ TEST, value.var = "SCORE")
# ID NAME SEX 1 2 3
#1 1 MIKE MALE 70 80 90
#2 2 LUCY FEMALE 65 75 NA
However, if the OP insists that the column names should be TEST_1, TEST_2, etc., the TEST column needs to be modified before reshaping. Here, data.table is used:
library(data.table)
dcast(setDT(X)[, TEST := paste0("TEST_", TEST)], ... ~ TEST, value.var = "SCORE")
# ID NAME SEX TEST_1 TEST_2 TEST_3
#1: 1 MIKE MALE 70 80 90
#2: 2 LUCY FEMALE 65 75 NA
which is in line with the expected answer given as data.frame Y.
I have a data frame:
station person_id date
1 0037 103103 2015-02-02
2 0037 306558 2015-02-02
3 0037 306558 2015-02-04
4 0037 306558 2015-02-05
I need to aggregate the frame by station and date, so that every unique station/date (every row) in the result shows how many people fall on that row.
For example, the first 2 rows would collapse into a single row that shows 2 people for station 0037 and date 2015-02-02.
I tried,
result <- data_frame %>% group_by(station, week = week(date)) %>% summarise_each(funs(length), -date)
You could try:
group_by(df, station, date) %>% summarise(num_people = length(person_id))
Source: local data frame [3 x 3]
Groups: station [?]
station date num_people
(int) (fctr) (int)
1 37 2015-02-02 2
2 37 2015-02-04 1
3 37 2015-02-05 1
In base R, you could use aggregate:
# sample dataset
set.seed(1234)
df <- data.frame(station=sample(1:3, 50, replace=T),
person_id=sample(30000:35000, 50, replace=T),
date=sample(seq(as.Date("2015-02-05"), as.Date("2015-02-12")
by="day"), 50, replace=T))
# calculate number of people per station on a particular date
aggregate(cbind("passengerCount"=person_id) ~ station + date, data=df, FUN=length)
The cbind function is not necessary, but it lets you provide a variable name.
With data.table, we convert the 'data.frame' to 'data.table', grouped by 'station', 'date', we get the number of rows (.N).
library(data.table)
setDT(df1)[, .(num_people = .N), .(station, date)]
# station date num_people
#1: 37 2015-02-02 2
#2: 37 2015-02-04 1
#3: 37 2015-02-05 1
This question already has answers here:
Select rows with min value by group
(10 answers)
Subset data based on Minimum Value
(2 answers)
Closed 4 years ago.
I would like to select the youngest person in each group and categorize it by gender
so this is my initial data
data1
ID Age Gender Group
1 A01 25 m a
2 A02 35 f b
3 B03 45 m b
4 C99 50 m b
5 F05 60 f a
6 X05 65 f a
I would like to have this
Gender Group Age ID
m a 25 A01
f a 60 F05
m b 45 B03
f b 35 A02
So I tried with aggraeate function but I don't know how to attach the ID to it
aggregate(Age~Gender+Group,data1,min)
Gender Group Age
m a 25
f a 60
m b 45
f b 35
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(data1)). If it is to get the row corresponding to the min of 'Age', we use which.min to get the row index of the min 'Age' grouped by 'Gender', 'Group' and then use that to subset the rows (.SD[which.min(Age)]).
setDT(data1)[, .SD[which.min(Age)], by = .(Gender, Group)]
Or another option would be to order by 'Gender', 'Group', 'Age', and then get the first row using unique.
unique(setDT(data1)[order(Gender,Group,Age)],
by = c('Gender', 'Group'))
Or using the same methodology with dplyr, we use slice with which.min to get the corresponding 'Age' grouped by 'Gender', 'Group'.
library(dplyr)
data1 %>%
group_by(Gender, Group) %>%
slice(which.min(Age))
Or we can arrange by 'Gender', 'Group', 'Age' and then get the first row
data1 %>%
arrange(Gender,Group, Age) %>%
group_by(Gender,Group) %>%
slice(1L)
I have a dataframe with some numbers(score) and repeating ID. I want to get the maximum value for each of the ID.
I used this function
top = aggregate(df$score, list(df$ID),max)
This returned me a top dataframe with maximum values corresponding to each ID.
But it so happens that for one of the ID, we have two EQUAL max value. But this function is ignoring the second value.
Is there any way to retain BOTH the max values.?
For Example:
df
ID score
1 12
1 15
1 1
1 15
2 23
2 12
2 13
The above function gives me this:
top
ID Score
1 15
2 23
I need this:
top
ID Score
1 15
1 15
2 23
I recommend data.table as Chris mentioned (good for speed, but steeper learning curve).
Or if you don't want data.table you could use plyr:
library(plyr)
ddply(df, .(ID), subset, score==max(score))
# same as ddply(df, .(ID), function (x) subset(x, score==max(score)))
You can convert to a data.table:
DT <- as.data.table(df)
DT[, .SD[score == max(score)], by=ID]
Here is a dplyr solution.
library(dplyr)
df %>%
group_by(ID) %>%
filter(score == max(score))
Otherwise, to build on what you have done, we can use a sneaky property of merge on your "top" dataframe, see the following example:
df1 <- data.frame(ID = c(1,1,5,2), score = c(5,5,2,6))
top_df <- data.frame(ID = c(1,2), score = c(5,6))
merge(df1, top_df)
which gives:
ID score
1 1 5
2 1 5
3 2 6
Staying with a data.frame:
df[unlist(by(df, df$ID, FUN=function(D) rownames(D)[D$score == max(D$score)] )),]
# ID score
#2 1 15
#4 1 15
#5 2 23
This works because by splits df into a list of data.frames on the basis of df$ID, but retains the original rownames of df ( see by(df, df$ID, I) ). Therefore, returning the rownames of each D subset corresponding to a max score value in each group can still be used to subset the original df.
A simple base R solution:
df <- data.frame(ID = c(1, 1, 1, 1, 2, 2, 2),
score = c(12, 15, 1, 15, 23, 12, 13))
Several options:
df[df$score %in% tapply(df$score, df$ID, max), ]
df[df$score %in% aggregate(score ~ ID, data = df, max)$score, ]
df[df$score %in% aggregate(df$score, list(df$ID), max)$x, ]
Output:
ID score
2 1 15
4 1 15
5 2 23
Using sqldf:
library(sqldf)
sqldf('SELECT df.ID, score FROM df
JOIN (SELECT ID, MAX(score) AS score FROM df GROUP BY ID)
USING (score)')
Output:
ID score
2 1 15
4 1 15
5 2 23