This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 3 years ago.
I have the following data:
set.seed(789)
df_1 = data.frame(a = 22, b = 24, c = rnorm(10))
df_2 = data.frame(a = 44, b = 24, c = rnorm(10))
df_3 = data.frame(a = 33, b = 99, c = rnorm(10))
df_all = rbind(df_1, df_2, df_3)
I need to group df_all by column a and b, and then find the 50th quantile based on column c.
This can be done singularly, for each df, as follows:
df_1_q = quantile(df_1$c, probs = 0.50)
df_2_q = quantile(df_2$c, probs = 0.50)
df_3_q = quantile(df_3$c, probs = 0.50)
However my real df_all is larger than this.
And more generally, how can I group a data.frame by rows and apply a given function?
thanks
You could use dplyr for that
library(dplyr)
df_all %>%
group_by(a, b) %>%
summarise(quantile = quantile(c, probs = 0.5))
# A tibble: 3 x 3
# Groups: a [?]
a b quantile
<dbl> <dbl> <dbl>
1 22 24 -0.268
2 33 99 -0.234
3 44 24 -0.445
Or using data.table as:
library(data.table)
dt <- data.table(df_all)
dt[,list(quantile=quantile(c, probs = 0.5)),by=c("a", "b")]
a b quantile
1: 22 24 -0.2679104
2: 44 24 -0.4450979
3: 33 99 -0.2336712
Related
I have a unique problem where I would like to add a column of percentiles for each group in a data frame. Here is how my data look like:
library(tidyverse)
set.seed(123)
df <- tibble(id = 1:100,
group = rep(letters[1:4], 25),
x = c(sample(1:100, 25, replace = T),
sample(101:200, 25, replace = T),
sample(201:300, 25, replace = T),
sample(301:400, 25, replace = T)))
> df
# A tibble: 100 x 3
id group x
<int> <chr> <int>
1 1 a 78
2 2 b 80
3 3 c 7
4 4 d 100
5 5 a 45
6 6 b 76
7 7 c 25
8 8 d 91
9 9 a 13
10 10 b 84
# ... with 90 more rows
# Function to create a table ten percentiles for a numeric vector
percentiles_table <- function(x) {
res <- round(quantile(x, probs = seq(from=.1, to=1, by=0.1)), 0)
res <- data.frame(percentile = names(res), to = res )
res <- res %>%
mutate(from = lag(to, default = 0)) %>%
select(from,to,percentile)
}
# Table of percentiles
percentiles <- df %>%
group_by(group) %>%
summarise(percentiles_table(x)) %>%
ungroup()
> percentiles
# A tibble: 40 x 4
group from to percentile
<chr> <dbl> <dbl> <chr>
1 a 0 25 10%
2 a 25 71 20%
3 a 71 106 30%
4 a 106 125 40%
5 a 125 198 50%
6 a 198 236 60%
7 a 236 278 70%
8 a 278 325 80%
9 a 325 379 90%
10 a 379 389 100%
I would like to add the percentile column to df for each group where the value of x falls between from and to.
There might be some way to calculate the percentile column directly without having it calculated in a separated data.frame and then appending it back to df.
A one-liner with my santoku package:
library(santoku)
df |>
group_by(group) |>
mutate(
percentile = chop_quantiles(x, 0:100/100,
labels = lbl_endpoint())
)
# A tibble: 100 × 4
# Groups: group [4]
id group x percentile
<int> <chr> <int> <fct>
1 1 a 35 8%
2 2 b 97 20%
3 3 c 39 4%
4 4 d 20 8%
5 5 a 89 16%
...
Using data.table:
setDT(df)[
,
percentile := cut(
x,
quantile(x, seq(0, 1, 0.1)),
include.lowest = TRUE,
labels = paste0(seq(10, 100, 10), "%")
),
by = group
]
install.packages("zoo")
library(zoo)
y=as.data.frame(c(0:max(percentiles$to)))
y=merge(y,unique(percentiles[,c(1)]))
y=merge(y,percentiles[,c(1,2,4)], by.x = c("group","c(0:max(percentiles$to))"), by.y = c("group","from"), all.x = TRUE)
y=na.locf(y)
df=merge(df,y, all.x = TRUE, by.x = c("group","x"), by.y = c("group","c(0:max(percentiles$to))"))
I got this working solution.
percentile_ranks <- function(x) {
res <- trunc(rank(x))/length(x) * 100
res <- floor(res/10) }
df <- df %>%
group_by(group) %>%
arrange(x) %>%
mutate(percentile = percentile_ranks(x)) %>%
mutate(percentile_pct = paste0(percentile*10,"%")) %>%
ungroup() %>%
arrange(id) # original data.frame order
I need help trying to make a dataset which contains which treatment the participants are on and what they scored in a composite test (this is just an exercise for my course no real data used)
A <- c(36, 35, 22, 20)
B <- c(26, 30, 25, 20)
C <- c(42, 30, 45, 62)
treatment <- c("A", "B", "C")
depression <- c(A, B, C)
df1 <- data.frame(treatment, depression)
It arranges the data in the dataframe wrong with each value being A, B, C instead of A, A , A , A, B...
anyone know how to convert and arrange this data?
I need the data arranged so I can split and do different calculations with them.
treatment <- rep(LETTERS[1:3], each=4)
depression <- c(A, B, C)
df1 <- data.frame(treatment, depression)
I think you're looking for
df1 <- data.frame(treatment = rep(treatment, each = 4), depression)
For production/"real-life" code you would probably want to do something fancier, e.g.
L <- tibble::lst(A,B,C) ## self-naming list
data.frame(treatment = rep(names(L), lengths(L)),
depression = unlist(L))
Here is tidyverse approach:
library(tidyverse)
tibble(depression) %>%
mutate(treatment = rep(treatment, each=length(A)))
depression treatment
<dbl> <chr>
1 36 A
2 35 A
3 22 A
4 20 A
5 26 B
6 30 B
7 25 B
8 20 B
9 42 C
10 30 C
11 45 C
12 62 C
I have two data frames. df_sub is a subset of the main data frame, df. I want to take a subset of df based on df_sub where the resulting data frame is going to be df_sub plus the observations that occur before and after.
As an example, consider the two data sets
df <- data.frame(var1 = c("a", "x", "x", "y", "z", "t"),
var2 = c(4, 1, 2, 45, 56, 89))
df_sub <- data.frame(var1 = c("x", "y"),
var2 = c(2, 45))
They look like
> df
var1 var2
1 a 4
2 x 1
3 x 2
4 y 45
5 z 56
6 t 89
> df_sub
var1 var2
1 x 2
2 y 45
The result I want would be
> df_result
2 x 1
3 x 2
4 y 45
5 z 56
I was thinking of using an inner_join or something similar
We could use match to get the index, then add or subtract 1 on those index, take the unique and subset the rows
v1 <- na.omit(match(do.call(paste, df_sub), do.call(paste, df)) )
df[unique(v1 + rep(c(-1, 0, 1), each = length(v1))),]
-output
var1 var2
2 x 1
3 x 2
4 y 45
5 z 56
Or create a 'flag' column in the 'df_sub', do a left_join, and then filter based on the lead/lag values of 'flag'
library(dplyr)
df %>%
left_join(df_sub %>%
mutate(flag = TRUE)) %>%
filter(flag|lag(flag)|lead(flag)) %>%
select(-flag)
var1 var2
1 x 1
2 x 2
3 y 45
4 z 56
You can create a row number to keep track of the rows that are selected via join. Subset the data by including minimum row number - 1 and maximum row number + 1.
library(dplyr)
tmp <- df %>%
mutate(row = row_number()) %>%
inner_join(df_sub, by = c("var1", "var2"))
df[c(min(tmp$row) - 1, tmp$row, max(tmp$row) + 1), ]
# var1 var2
#2 x 1
#3 x 2
#4 y 45
#5 z 56
I have two data frames:
require(tidyverse)
set.seed(42)
df1 = data_frame(x = c(4,3), y = c(0, 0), z = c(NA, 3))
df2 = data_frame(x = sample(1:4, 100, replace = T), y = sample(c(-3, 0, 3), 100, replace = T), z = c(NA, NA, rep(3, 98))) %>% mutate(Tracking = row_number())
I would like to separately for each row of df1 AND for each column of df1 to find the indices of df2 for which df2 is equal to df1. If I tried to loop then each iteration would look like:
for (i in 1: nrow(df1)){
for (j in 1: ncol(df1)) {
L[[i]][j] = inner_join(df1[i,j], df2)
}
}
for example, the first element of the list is:
inner_join(df1[1,1], df2)
Joining, by = "x"
# A tibble: 26 x 4
x y z Tracking
<dbl> <dbl> <dbl> <int>
1 4. 0. NA 1
2 4. -3. NA 2
3 4. 0. 3. 4
4 4. 3. 3. 13
5 4. 0. 3. 16
6 4. -3. 3. 17
7 4. 0. 3. 21
8 4. 0. 3. 23
9 4. 0. 3. 24
10 4. 3. 3. 28
# ... with 16 more rows
However I am sure there's a more efficient way to do this. Possibly dplyr + purrr? I don't have much experience with purrr, but I have a feeling the map function can come in handy. I just don't know how to call the columns separately.
You could do something like
L <- map(names(df1),
function(.) {
out <- inner_join(x = df1[, ., drop = FALSE],
y = df2,
by = .)
split(out, out[[.]])
})
but I'm not sure if this is better or more efficient than the for loop you started with.
This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
I have a dataframe with two groups and values. I have to find max value by one group (group) and discover, to which values does my max correspond to in the second group (dist).
# example
df<-data.frame(group = rep(c("a", "b"), each = 5),
val = 1:10,
dist = rep(c("NR", "b1"), 5))
> df
group val dist
1 a 1 NR
2 a 2 b1
3 a 3 NR
4 a 4 b1
5 a 5 NR
6 b 6 b1
7 b 7 NR
8 b 8 b1
9 b 9 NR
10 b 10 b1
I can get the max values by group:
aggregate(val ~ group, df, max)
group val
1 a 5
2 b 10
or by tapply:
tapply(df$val, df$group, max)
but I need to know, in what "dist" is max located.
group val dist
1 a 5 NR
2 b 10 b1
How to accomplish this?
We can slice the row which have the max 'val' for each 'group'
library(dplyr)
df %>%
group_by(group) %>%
slice(which.max(val))
If there are ties for max value, then do a comparison and filter the rows
df %>%
group_by(group) %>%
filter(val == max(val))
Or with ave from base R
df[with(df, val == ave(val, group, FUN= max)),]
# group val dist
#5 a 5 NR
#10 b 10 b1
df<-data.frame(group = rep(c("a", "b"), each = 5),
val = 1:10,
dist = rep(c("NR", "b1"), 5))
df1 <- split(df, df$group)
df2 <- lapply(df1, function(i) i[which(i$val== max(i$val)),] )
df3 <- do.call(rbind, df2)