I have a dataframe with some numbers(score) and repeating ID. I want to get the maximum value for each of the ID.
I used this function
top = aggregate(df$score, list(df$ID),max)
This returned me a top dataframe with maximum values corresponding to each ID.
But it so happens that for one of the ID, we have two EQUAL max value. But this function is ignoring the second value.
Is there any way to retain BOTH the max values.?
For Example:
df
ID score
1 12
1 15
1 1
1 15
2 23
2 12
2 13
The above function gives me this:
top
ID Score
1 15
2 23
I need this:
top
ID Score
1 15
1 15
2 23
I recommend data.table as Chris mentioned (good for speed, but steeper learning curve).
Or if you don't want data.table you could use plyr:
library(plyr)
ddply(df, .(ID), subset, score==max(score))
# same as ddply(df, .(ID), function (x) subset(x, score==max(score)))
You can convert to a data.table:
DT <- as.data.table(df)
DT[, .SD[score == max(score)], by=ID]
Here is a dplyr solution.
library(dplyr)
df %>%
group_by(ID) %>%
filter(score == max(score))
Otherwise, to build on what you have done, we can use a sneaky property of merge on your "top" dataframe, see the following example:
df1 <- data.frame(ID = c(1,1,5,2), score = c(5,5,2,6))
top_df <- data.frame(ID = c(1,2), score = c(5,6))
merge(df1, top_df)
which gives:
ID score
1 1 5
2 1 5
3 2 6
Staying with a data.frame:
df[unlist(by(df, df$ID, FUN=function(D) rownames(D)[D$score == max(D$score)] )),]
# ID score
#2 1 15
#4 1 15
#5 2 23
This works because by splits df into a list of data.frames on the basis of df$ID, but retains the original rownames of df ( see by(df, df$ID, I) ). Therefore, returning the rownames of each D subset corresponding to a max score value in each group can still be used to subset the original df.
A simple base R solution:
df <- data.frame(ID = c(1, 1, 1, 1, 2, 2, 2),
score = c(12, 15, 1, 15, 23, 12, 13))
Several options:
df[df$score %in% tapply(df$score, df$ID, max), ]
df[df$score %in% aggregate(score ~ ID, data = df, max)$score, ]
df[df$score %in% aggregate(df$score, list(df$ID), max)$x, ]
Output:
ID score
2 1 15
4 1 15
5 2 23
Using sqldf:
library(sqldf)
sqldf('SELECT df.ID, score FROM df
JOIN (SELECT ID, MAX(score) AS score FROM df GROUP BY ID)
USING (score)')
Output:
ID score
2 1 15
4 1 15
5 2 23
Related
In R, I'm trying to average a subset of a column based on selecting a certain value (ID) in another column. Consider the example of choosing an ID among 100 IDs, perhaps the ID number being 5. Then, I want to average a subset of values in another column that corresponds to the ID number that is 5. Then, I want to do the same thing for the rest of the IDs. What should this function be?
Using dplyr:
library(dplyr)
dt <- data.frame(ID = rep(1:3, each=3), values = runif(9, 1, 100))
dt %>%
group_by(ID) %>%
summarise(avg = mean(values))
Output:
ID avg
<int> <dbl>
1 1 41.9
2 2 79.8
3 3 39.3
Data:
ID values
1 1 8.628964
2 1 99.767843
3 1 17.438596
4 2 79.700918
5 2 87.647472
6 2 72.135906
7 3 53.845573
8 3 50.205122
9 3 13.811414
We can use a group by mean. In base R, this can be done with aggregate
dt <- data.frame(ID = rep(1:3, each=3), values = runif(9, 1, 100))
aggregate(values ~ ID, dt, mean)
Output:
ID values
1 1 40.07086
2 2 53.59345
3 3 47.80675
Similar to this question here, I am trying to find the difference between the maximum value of a group and the value of the current row.
For instance, if I have the following dataset:
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
group <- data.frame(Subject=ID, pt=Value)
How would I go about creating a new column called "diff" that would be the difference between the value of the current row and the maximum value in that group?
Thank you for your help!
The OP has tried a data.table solution. Here, we benefit from grouping and updating by reference simultaneously.
library(data.table)
setDT(group)[, diff := max(pt) - pt, by = Subject][]
Subject pt diff
1: 1 2 3
2: 1 3 2
3: 1 5 0
4: 2 2 15
5: 2 5 12
6: 2 8 9
7: 2 17 0
8: 3 3 2
9: 3 5 0
Data
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
group <- data.frame(Subject=ID, pt=Value)
Benchmark
At the time of writing, 5 answers were posted, including Frank's comment on the efficiency of th data.table approach. So, I was wondering which of the five solutions were the fastest.
r2evans
mine
Frank
harelhan
JonMinton
Some solutions modify the data.frame in place. To ensure a fair comparison, In addition,
The OP has required to create a new column called "diff". For comparison, all results should return group with three columns. Some answers were modified accordingly. The answer of harelhan required substantial modifications to remove the errors.
As group is modified, all benchmark runs start with a fresh copy of group with two columns.
The benchmark is parameterized over the number of rows and the share of groups, i.e., the number of groups varies with the problem size in order to scale.
library(data.table)
library(dplyr)
library(bench)
bm <- press(
# n_row = c(1E2, 1E4, 1E5, 1E6),
n_row = c(1E2, 1E4, 1E5),
grp_share = c(0.01, 0.1, 0.5, 0.9),
{
n_grp <- grp_share * n_row
set.seed(1)
group0 <- data.frame(
Subject = sample(n_grp, n_row, TRUE),
pt = as.numeric(rpois(n_row, 100)))
mark(
r2Evans = {
group <- copy(group0)
group <- group %>%
group_by(Subject) %>%
mutate(diff = max(pt) - pt)
group
},
Uwe = {
group <- copy(group0)
setDT(group)[, diff := max(pt) - pt, by = Subject]
group
},
Frank = {
group <- copy(group0)
setDT(group)[, mx := max(pt), by=Subject][, diff := mx - pt][, mx := NULL]
group
},
harelhan = {
group <- copy(group0)
max_group <- group %>% group_by(Subject) %>% summarize(max_val = max(pt))
group <- left_join(group, max_group[, c("Subject", "max_val")], by = "Subject")
group$diff <- group$max_val - group$pt
group <- group %>% select(-max_val)
group
},
JonMinton = {
group <- copy(group0)
group <- group %>%
group_by(Subject) %>%
mutate(max_group_val = max(pt)) %>%
ungroup() %>%
mutate(diff = max_group_val - pt) %>%
select(-max_group_val)
group
}
)
}
)
ggplot2::autoplot(bm)
Using your example data and breaking the logic into smaller steps:
library(dplyr)
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
group <- data.frame(Subject=ID, pt=Value)
max_group <- group %>% group_by(ID) %>% summarize(max_val = max(Value))
group <- left_join(group, max_group[,c(ID, max_val)], by = ID)
group$diff <- group$max_val - group$Value
Hope this solves the problem.
Based on harelhan's answer, but with piping:
require(dplyr)
df <- data_frame(
id = c(1,1,1,2,2,2,2,3,3),
value = c(2,3,5,2,5,8,17,3,5)
)
df %>%
group_by(id) %>%
mutate(max_group_val = max(value)) %>%
ungroup() %>%
mutate(diff_frm_group_max = max_group_val - value)
A tibble: 9 x 4
id value max_group_val diff_frm_group_max
<dbl> <dbl> <dbl> <dbl>
1 1 2 5 3
2 1 3 5 2
3 1 5 5 0
4 2 2 17 15
5 2 5 17 12
6 2 8 17 9
7 2 17 17 0
8 3 3 5 2
9 3 5 5 0
I am trying to summarize data with NA values and am using the ddply function.
For example, using the data included below,
set.seed(123)
dat <- data.frame(IndID = rep(c("AAA", "BBB", "CCC"), 100),
ValOne = sample(c(1, 0, NA), replace = T, 300),
ValTwo = sample(c(1,NA), replace = T, 300),
VarThree = sample(c("Thanks", "alot"), replace = T, 300))
> head(dat)
IndID ValOne ValTwo
1 AAA 1 NA
2 BBB NA 1
3 CCC 0 NA
4 AAA NA NA
5 BBB NA NA
6 CCC 1 1
I want to calculate the number of times that each individual has a 1 in the ValOne and ValTwo column I am using the code below to create a new data.frame and summarize the data by IndID and use both length and sum functions.
library(plyr)
tbl <- ddply(dat, "IndID", summarise,
ColOne = length(dat$ValOne[dat$ColOne == 1]),
NumHighHDOP = sum(dat$ValTwo[dat$ValTwo == 1], na.rm = T))
As seen below,
IndID ColOne NumHighHDOP
1 AAA 0 155
2 BBB 0 155
3 CCC 0 155
the resulting table summarizes the data for the entire data.frame and not for each individual.
Both approaches (length and sum) are struggling with the NAs in the data.frame. Any suggestions would be appreciated.
EDIT With the new data set including a factor. Is it also possible to calculate the number of "Thanks" for each individual?
We can use dplyr. We group by 'IndID', and get the count of 1 for each column with summarise_each. To remove the NA elements, either na.omit or use a logical condition to output TRUE only for 1.
library(dplyr)
dat %>%
group_by(IndID) %>%
summarise_each(funs(sum(.==1 & !is.na(.))))
#or
#summarise_each(funs(sum(na.omit(.)==1)))
Update
Based on the updated dataset in the OP's post, if we want to count the 'Thanks' in the third column, we can use %in% (assuming that 'Thanks' is not present in the other two columns and 1 not in the last column).
dat %>%
group_by(IndID) %>%
summarise_each(funs(sum(na.omit(.) %in% c(1, 'Thanks'))))
I am working with a data frame corresponding to the example below:
set.seed(1)
dta <- data.frame("CatA" = rep(c("A","B","C"), 4), "CatNum" = rep(1:2,6),
"SomeVal" = runif(12))
I would like to quickly build a data frame that would have sum values for all the combinations of the categories derived from the CatA and CatNum as well as for the categories derived from each column separately. On the primitive example above, for the first couple of combinations, this can be achieved with use of simple code:
df_sums <- data.frame(
"Category" = c("Total for A",
"Total for A and 1",
"Total for A and 2"),
"Sum" = c(sum(dta$SomeVal[dta$CatA == 'A']),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 1]),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 2]))
)
This produces and informative data frame of sums:
Category Sum
1 Total for A 2.1801780
2 Total for A and 1 1.2101839
3 Total for A and 2 0.9699941
This solution would be grossly inefficient when applied to a data frame with multiple categories. I would like to achieve the following:
Cycle through all the categories, including categories derived from each column separately as well as from both columns in the same time
Achieve some flexibility with respect to how the function is applied, for instance I may want to apply mean instead of the sum
Save the Total for string a separate object that I could easily edit when applying other function than sum.
I was initially thinking of using dplyr, on the lines:
require(dplyr)
df_sums_experiment <- dta %>%
group_by(CatA, CatNum) %>%
summarise(TotVal = sum(SomeVal))
But it's not clear to me how I could apply multiple groupings simultaneously. As stated, I'm interested in grouping by each column separately and by the combination of both columns. I would also like to create a string column that would indicate what is combined and in what order.
You could use tidyr to unite the columns and gather the data. Then use dplyr to summarise:
library(dplyr)
library(tidyr)
dta %>% unite(measurevar, CatA, CatNum, remove=FALSE) %>%
gather(key, val, -SomeVal) %>%
group_by(val) %>%
summarise(sum(SomeVal))
val sum(SomeVal)
(chr) (dbl)
1 1 2.8198078
2 2 3.0778622
3 A 2.1801780
4 A_1 1.2101839
5 A_2 0.9699941
6 B 1.4405782
7 B_1 0.4076565
8 B_2 1.0329217
9 C 2.2769138
10 C_1 1.2019674
11 C_2 1.0749464
Just loop over the column combinations, compute the quantities you want and then rbind them together:
library(data.table)
dt = as.data.table(dta) # or setDT to convert in place
cols = c('CatA', 'CatNum')
rbindlist(apply(combn(c(cols, ""), length(cols)), 2,
function(i) dt[, sum(SomeVal), by = c(i[i != ""])]), fill = T)
# CatA CatNum V1
# 1: A 1 1.2101839
# 2: B 2 1.0329217
# 3: C 1 1.2019674
# 4: A 2 0.9699941
# 5: B 1 0.4076565
# 6: C 2 1.0749464
# 7: A NA 2.1801780
# 8: B NA 1.4405782
# 9: C NA 2.2769138
#10: NA 1 2.8198078
#11: NA 2 3.0778622
Split then use apply
#result
res <- do.call(rbind,
lapply(
c(split(dta,dta$CatA),
split(dta,dta$CatNum),
split(dta,dta[,1:2])),
function(i)sum(i[,"SomeVal"])))
#prettify the result
res1 <- data.frame(Category=paste0("Total for ",rownames(res)),
Sum=res[,1])
res1$Category <- sub("."," and ",res1$Category,fixed=TRUE)
row.names(res1) <- seq_along(row.names(res1))
res1
# Category Sum
# 1 Total for A 2.1801780
# 2 Total for B 1.4405782
# 3 Total for C 2.2769138
# 4 Total for 1 2.8198078
# 5 Total for 2 3.0778622
# 6 Total for A and 1 1.2101839
# 7 Total for B and 1 0.4076565
# 8 Total for C and 1 1.2019674
# 9 Total for A and 2 0.9699941
# 10 Total for B and 2 1.0329217
# 11 Total for C and 2 1.0749464
I need to reshape data.frame in R in one step.
In short, change of values of objects (x1 to x6) is visible row by row (from 1990 to 1995):
> tab1[1:10, ] # raw data see plot for tab1
id value year
1 x1 7 1990
2 x1 10 1991
3 x1 11 1992
4 x1 7 1993
5 x1 3 1994
6 x1 1 1995
7 x2 6 1990
8 x2 7 1991
9 x2 9 1992
10 x2 5 1993
I am able to do reshaping step by step, does anybody know how do it in one step?
Original data
Table 1 - see that minimal value from all timeseries is "0"
Step1:
Table 2 - rescale each timeseries that each would have minimal value equal "0".
All times fall down on x-axes.
Step2:
Table 3 - apply diff() function on each timeline.
Step3:
Table 4 - apply sort() function on each timeseries.
I hope the pictures are clear enough for understanding each step.
So final table looks like this:
> tab4[1:10, ]
id value time
1 x1 -4 1
2 x1 -4 2
3 x1 -2 3
4 x1 1 4
5 x1 3 5
6 x2 -4 1
7 x2 -3 2
8 x2 1 3
9 x2 1 4
10 x2 2 5
# Source data:
tab1 <- data.frame(id = rep(c("x1","x2","x3","x4","x5","x6"), each = 6),
value = c(7,10,11,7,3,1,6,7,9,5,2,3,11,9,7,9,1,
0,1,2,2,4,7,4,2,3,1,6,4,2,3,5,4,3,5,6),
year = rep(c(1990:1995), times = 6))
tab2 <- data.frame(id = rep(c("x1","x2","x3","x4","x5","x6"), each = 6),
value = c(6,9,10,6,2,0,4,5,7,3,0,1,11,9,7,9,1,0,
0,1,1,3,6,3,1,2,0,5,3,1,0,2,1,0,2,3),
year = rep(c(1990:1995), times = 6))
tab3 <- data.frame(id = rep(c("x1","x2","x3","x4","x5","x6"), each = 5),
value = c(3,1,-4,-4,-2,1,2,-4,-3,1,-2,-2,2,-8,-1,
1,0,2,3,-3,1,-2,5,-2,-2,2,-1,-1,2,1),
time = rep(c(1:5), times = 6))
tab4 <- data.frame(id = rep(c("x1","x2","x3","x4","x5","x6"), each = 5),
value = c(-4,-4,-2,1,3,-4,-3,1,1,2,-8,-2,-2,-1,2,
-3,0,1,2,3,-2,-2,-2,1,5,-1,-1,1,2,2),
time = rep(c(1:5), times = 6))
Using data.table, this is simply:
require(data.table) ## 1.9.2
ans <- setDT(tab1)[, list(value=diff(value)), by=id] ## aggregation
setkey(ans, id,value)[, time := seq_len(.N), by=id] ## order + add 'time' column
Note that your 'step 1' is unnecessary as your second step is calculating difference and it wouldn't have any effect (and is therefore skipped here).
It sounds like you want to apply a set of functions to each group of a grouping variable. There are many ways to do this in R (from base R by and tapply to add-on packages like plyr, data.table, and dplyr). I've been learning how to use package dplyr, and came up with the following solution.
require(dplyr)
tab4 = tab1 %>%
group_by(id) %>% # group by id
mutate(value = value - min(value), value = value - lag(value)) %>% # group min to 0, difference lag 1
na.omit %>% # remove NA caused by lag 1 differencing
arrange(id, value) %>% # order by value within each id
mutate(time = 1:length(value)) %>% # Make a time variable from 1 to 5 based on current order
select(-year) # remove year column to match final OP output