I am trying to aggregate a grand mean from mean scores for students. Here is how my dataset looks like:
id <- c(1,1,1, 2,2,2, 3,3, 4,4,4)
mean <- c(5,5,5, 6,6,6, 7,7, 8,8,8)
data <- data.frame(id,mean)
> data
id mean
1 1 5
2 1 5
3 1 5
4 2 6
5 2 6
6 2 6
7 3 7
8 3 7
9 4 8
10 4 8
11 4 8
I am using dplyr package for this calculation. I use this,
data %>%
mutate(grand.mean = mean(mean))
id mean grand.mean
1 1 5 6.454545
2 1 5 6.454545
3 1 5 6.454545
4 2 6 6.454545
5 2 6 6.454545
6 2 6 6.454545
7 3 7 6.454545
8 3 7 6.454545
9 4 8 6.454545
10 4 8 6.454545
11 4 8 6.454545
However, this does not consider repeated means for each id. The calculation should be grabbing unique means from each id and average them over.
so it is (5+6+7+8)/4 = 6.5 instead of 6.45.
Any ideas?
Thanks!
If there are duplicates for mean in different 'id', use match to get the position of the first 'id' and get the mean of the 'mean' column
library(dplyr)
data %>%
mutate(grand.mean = mean(mean[match(unique(id), id)]))
# id mean grand.mean
#1 1 5 6.5
#2 1 5 6.5
#3 1 5 6.5
#4 2 6 6.5
#5 2 6 6.5
#6 2 6 6.5
#7 3 7 6.5
#8 3 7 6.5
#9 4 8 6.5
#10 4 8 6.5
#11 4 8 6.5
Or another option is duplicated
data %>%
mutate(grand.mean = mean(mean[!duplicated(id)]))
Or take the distinct rows. of 'id', 'mean', get the mean, and bind the columns with original dataset
library(tidyr)
data %>%
distinct(id, mean) %>%
summarise(grand.mean = mean(mean)) %>%
uncount(nrow(data)) %>%
bind_cols(data, .)
A base R one-liner could be:
mean(tapply(data$mean, data$id, '[', 1))
#[1] 6.5
To put the result in the original data set do
data$grand.mean <- mean(tapply(data$mean, data$id, '[', 1))
You can use unique and than caluculate mean to get a grand mean.
mean(unique(data)[,"mean"])
#[1] 6.5
Or you can aggregate by id and then caluculate mean to get a grand mean.
mean(aggregate(mean~id, data, base::mean)[,"mean"])
#[1] 6.5
Or use ave to get the number repeated values per id and use this as a weight in weighted.mean.
weighted.mean(mean, 1/ave(id, id, FUN=length))
#[1] 6.5
If you only need a single answer for the grand mean, just use two 'summarise' steps with 'dplyr':
library(dplyr)
data %>%
group_by(id) %>%
summarise(mean = mean(mean)) %>%
summarise(grand.mean = mean(mean))
Result:
grand.mean
<dbl>
1 6.5
Using dplyr, we can group_by id and get the mean of unique mean values in each id, then get the grand_mean of the entire dataset and do a right_join with the original data to add grand_mean as a new column.
library(dplyr)
data %>%
group_by(id) %>%
summarise(grand_mean = mean(unique(mean))) %>%
mutate(grand_mean = mean(grand_mean)) %>%
right_join(data, by = 'id')
# A tibble: 11 x 3
# id grand_mean mean
# <dbl> <dbl> <dbl>
# 1 1 6.5 5
# 2 1 6.5 5
# 3 1 6.5 5
# 4 2 6.5 6
# 5 2 6.5 6
# 6 2 6.5 6
# 7 3 6.5 7
# 8 3 6.5 7
# 9 4 6.5 8
#10 4 6.5 8
#11 4 6.5 8
Related
I am trying to create consecutive ID numbers for each distinct study. I found an example of data where they managed to create such an ID number under esid variable
Browse[1]> dat <- dat.assink2016
Browse[1]> head(dat, 9)
study esid id yi vi pubstatus year deltype
1 1 1 1 0.9066 0.0740 1 4.5 general
2 1 2 2 0.4295 0.0398 1 4.5 general
3 1 3 3 0.2679 0.0481 1 4.5 general
4 1 4 4 0.2078 0.0239 1 4.5 general
5 1 5 5 0.0526 0.0331 1 4.5 general
6 1 6 6 -0.0507 0.0886 1 4.5 general
7 2 1 7 0.5117 0.0115 1 1.5 general
8 2 2 8 0.4738 0.0076 1 1.5 general
9 2 3 9 0.3544 0.0065 1 1.5 general
I would like to create the same for my study, can anyone show me how to do it?
The key is to group_by study, then use row_number
library(dplyr)
df %>%
group_by(study) %>%
mutate(esid = row_number())
with the example data from #njp:
# A tibble: 9 × 3
# Groups: study [3]
study id esid
<dbl> <int> <int>
1 1 1 1
2 1 2 2
3 1 3 3
4 2 4 1
5 2 5 2
6 2 6 3
7 2 7 4
8 3 8 1
9 3 9 2
If the id column is consecutive (i.e. no jumps or repeated values) you could subtract the minimum value of id for each study and add one:
# Example data
df = data.frame(study=c(1,1,1,2,2,2,2,3,3),
id=1:9)
# Calculate minima
min.id = tapply(X=df$id,
INDEX=df$study,
FUN=min)
# merge this with the data
df$min.id = min.id[df$study]
# Calculate consecutive id as required
df$esid = df$id - df$min.id+1
This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 7 months ago.
I have a data frame in this format:
Group
Observation
a
1
a
2
a
3
b
4
b
5
c
6
c
7
c
8
I want to create a unique ID column which considers both group and each unique observation within it, so that it is formatted like so:
Group
Observation
Unique_ID
a
1
1.1
a
2
1.2
a
3
1.3
b
4
2.1
b
5
2.2
c
6
3.1
c
7
3.2
c
8
3.3
Does anyone know of any syntax or functions to accomplish this? The formatting does not need to exactly match '1.1' as long as it signifies group and each unique observation within it. Thanks in advance
Another way using cur_group_id and row_number
library(dplyr)
A <- 'Group Observation
a 1
a 2
a 3
b 4
b 5
c 6
c 7
c 8'
df <- read.table(textConnection(A), header = TRUE)
df |>
group_by(Group) |>
mutate(Unique_ID = paste0(cur_group_id(), ".", row_number())) |>
ungroup()
Group Observation Unique_ID
<chr> <int> <chr>
1 a 1 1.1
2 a 2 1.2
3 a 3 1.3
4 b 4 2.1
5 b 5 2.2
6 c 6 3.1
7 c 7 3.2
8 c 8 3.3
library(tidyverse)
df <- read_table("Group Observation
a 1
a 2
a 3
b 4
b 5
c 6
c 7
c 8")
df %>%
mutate(unique = Group %>%
as.factor() %>%
as.integer() %>%
paste(., Observation, sep = "."))
#> # A tibble: 8 x 3
#> Group Observation unique
#> <chr> <dbl> <chr>
#> 1 a 1 1.1
#> 2 a 2 1.2
#> 3 a 3 1.3
#> 4 b 4 2.4
#> 5 b 5 2.5
#> 6 c 6 3.6
#> 7 c 7 3.7
#> 8 c 8 3.8
Created on 2022-07-12 by the reprex package (v2.0.1)
Try this
df |> group_by(Group) |>
mutate(Unique_ID = paste0(cur_group_id(),".",1:n()))
output
# A tibble: 8 × 3
# Groups: Group [3]
Group Observation Unique_ID
<chr> <int> <chr>
1 a 1 1.1
2 a 2 1.2
3 a 3 1.3
4 b 4 2.1
5 b 5 2.2
6 c 6 3.1
7 c 7 3.2
8 c 8 3.3
This question already has answers here:
Filter group of rows based on sum of values from different column
(2 answers)
Closed 2 years ago.
I have a data.frame that looks like this
data=data.frame(group=c("A","B","C","A","B","C","A","B","C"),
time= c(rep(1,3),rep(2,3), rep(3,3)),
value=c(0,1,1,0.1,10,20,10,20,30))
group time value
1 A 1 0.0
2 B 1 1.0
3 C 1 1.0
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
I would like to find an elegant way to erase a group when its values are smaller < 0.2 in two different time points. Those points do not have to be consecutive.
In this case, I would like to filter out group A because its value at time point 1 and time point 2 is smaller than < 0.2.
group time value
1 B 1 1.0
2 C 1 1.0
3 B 2 10.0
4 C 2 20.0
5 B 3 20.0
6 C 3 30.0
With this solution you check that no group has more than 1 observation with values under 0.2 as you requested.
library(dplyr)
data %>%
group_by(group) %>%
filter(sum(value < 0.2) < 2) %>%
ungroup()
#> # A tibble: 6 x 3
#> group time value
#> <chr> <dbl> <dbl>
#> 1 B 1 1
#> 2 C 1 1
#> 3 B 2 10
#> 4 C 2 20
#> 5 B 3 20
#> 6 C 3 30
But if you are really a fan of base R:
data[ave(data$value<0.2, data$group, FUN = function(x) sum(x)<2), ]
#> group time value
#> 2 B 1 1
#> 3 C 1 1
#> 5 B 2 10
#> 6 C 2 20
#> 8 B 3 20
#> 9 C 3 30
Try this dplyr approach:
library(tidyverse)
#Code
data <- data %>% group_by(group) %>% mutate(Flag=any(value<0.2)) %>%
filter(Flag==F) %>% select(-Flag)
Output:
# A tibble: 6 x 3
# Groups: group [2]
group time value
<fct> <dbl> <dbl>
1 B 1 1
2 C 1 1
3 B 2 10
4 C 2 20
5 B 3 20
6 C 3 30
I want to create a custom column to a dataframe grouped by level which will be the mean of 2 variables. When there is a missing data, I need a sign of masking, like: "--". Example:
df <- data.frame(level= c(1,2,3,4,5,6,7,8,8,6,7,5,4,2), var1=c(1,1,2,3,4,5,6,7,8,8,6,7,5,4), var2 = c(2,NA,1,2,3,4,5,6,7,8,8,6,7,5))
doesn't work:
df %>%
group_by(level) %>%
mutate(result = ifelse(is.na(var1) | is.na(var2), "--", mean(c(var1,var2))))
doesn't work:
df %>%
group_by(level) %>%
mutate(result = ifelse(!(is.na(var1) | is.na(var2)), mean(c(var1,var2)), "--" ))
doesn't give an error:
df %>%
group_by(level) %>%
mutate(result = ifelse(is.na(var1) | is.na(var2), mean(c(var1,var2)), "--" ))
The error I get in 2 first cases is:
Error in mutate_impl(.data, dots) :
Column `result` can't be converted from numeric to character
Can you tell me what am I missing, how does mutate work so I can actually obtain what I need?
Thanks!
From ifelse documentation -
ifelse(test, yes, no)
ifelse returns a vector of the same length and attributes (including
dimensions and "class") as test and data values from the values of yes
or no. The mode of the answer will be coerced from logical to
accommodate first any values taken from yes and then any values taken
from no
Basically you can't mix characters and numbers for yes/no values. It is not a good idea mix characters and numbers in the same variable anyways. Consider using NA_real_ instead of --. If you must do it your way then you can try using as.character(mean(c(var1,var2))) but now your means are returned as characters.
df %>%
group_by(level) %>%
mutate(result = ifelse(is.na(var1) | is.na(var2), "--", as.character(mean(c(var1,var2)))))
# A tibble: 14 x 4
# Groups: level [8]
level var1 var2 result
<dbl> <dbl> <dbl> <chr>
1 1 1 2 1.5
2 2 1 NA --
3 3 2 1 1.5
4 4 3 2 4.25
5 5 4 3 5
6 6 5 4 6.25
7 7 6 5 6.25
8 8 7 6 7
9 8 8 7 7
10 6 8 8 6.25
11 7 6 8 6.25
12 5 7 6 5
13 4 5 7 4.25
14 2 4 5 NA
Note -
You can use write.csv(df, "report.csv", na = "--") if you only want to replace NA with "--" in your report.
We can use case_when
df %>%
group_by(level) %>%
mutate(result = case_when(is.na(var1)|is.na(var2) ~ "--",
TRUE ~ as.character(mean(c(var1, var2)))))
# A tibble: 14 x 4
# Groups: level [8]
# level var1 var2 result
# <dbl> <dbl> <dbl> <chr>
# 1 1 1 2 1.5
# 2 2 1 NA --
# 3 3 2 1 1.5
# 4 4 3 2 4.25
# 5 5 4 3 5
# 6 6 5 4 6.25
# 7 7 6 5 6.25
# 8 8 7 6 7
# 9 8 8 7 7
#10 6 8 8 6.25
#11 7 6 8 6.25
#12 5 7 6 5
#13 4 5 7 4.25
#14 2 4 5 <NA>
suppose I have a dataset like this
df <- data.frame(group = c(rep(1,3),rep(2,2), rep(3,2),rep(4,3),rep(5, 2)), score = c(30, 10, 22, 44, 6, 5, 20, 35, 2, 60, 14,5))
group score
1 1 30
2 1 10
3 1 22
4 2 44
5 2 6
6 3 5
7 3 20
8 4 35
9 4 2
10 4 60
11 5 14
12 5 5
I want to remove the first row for each group, the expected out put should look like this:
group score
1 1 10
2 1 22
3 2 6
4 3 20
5 4 2
6 4 60
7 5 5
Is there a simple way to do this?
An option with dplyr is to select rows ignoring 1st row
library(dplyr)
df %>%
group_by(group) %>%
slice(2:n())
# group score
# <dbl> <dbl>
#1 1.00 10.0
#2 1.00 22.0
#3 2.00 6.00
#4 3.00 20.0
#5 4.00 2.00
#6 4.00 60.0
#7 5.00 5.00
Another way is shown by #Rich Scriven in now deleted answer
df %>%
group_by(group) %>%
slice(-1)
Quite simple with duplicated
df[duplicated(df$group),]
group score
2 1 10
3 1 22
5 2 6
7 3 20
9 4 2
10 4 60
12 5 5
Another base R option would be to check the adjacent elements
df[c(FALSE,df$group[-1]==df$group[-nrow(df)]),]
# group score
#2 1 10
#3 1 22
#5 2 6
#7 3 20
#9 4 2
#10 4 60
#12 5 5
Here I removed the first observation in 'group' (df$group[-1]) and compared (==) with the vector in which last observation is removed (df$group[-nrow(df)])). As the length of the comparison is one less than the nrow of the dataset, we pad with FALSE at the top and use this as logical index to subset the dataset.
dplyr::filter(df, group == lag(group))
group score
1 1 10
2 1 22
3 2 6
4 3 20
5 4 2
6 4 60
7 5 5
See lead and lag of package dplyr for more information:
https://dplyr.tidyverse.org/reference/lead-lag.html