Adding unique ID column associated to two groups R [duplicate] - r

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 7 months ago.
I have a data frame in this format:
Group
Observation
a
1
a
2
a
3
b
4
b
5
c
6
c
7
c
8
I want to create a unique ID column which considers both group and each unique observation within it, so that it is formatted like so:
Group
Observation
Unique_ID
a
1
1.1
a
2
1.2
a
3
1.3
b
4
2.1
b
5
2.2
c
6
3.1
c
7
3.2
c
8
3.3
Does anyone know of any syntax or functions to accomplish this? The formatting does not need to exactly match '1.1' as long as it signifies group and each unique observation within it. Thanks in advance

Another way using cur_group_id and row_number
library(dplyr)
A <- 'Group Observation
a 1
a 2
a 3
b 4
b 5
c 6
c 7
c 8'
df <- read.table(textConnection(A), header = TRUE)
df |>
group_by(Group) |>
mutate(Unique_ID = paste0(cur_group_id(), ".", row_number())) |>
ungroup()
Group Observation Unique_ID
<chr> <int> <chr>
1 a 1 1.1
2 a 2 1.2
3 a 3 1.3
4 b 4 2.1
5 b 5 2.2
6 c 6 3.1
7 c 7 3.2
8 c 8 3.3

library(tidyverse)
df <- read_table("Group Observation
a 1
a 2
a 3
b 4
b 5
c 6
c 7
c 8")
df %>%
mutate(unique = Group %>%
as.factor() %>%
as.integer() %>%
paste(., Observation, sep = "."))
#> # A tibble: 8 x 3
#> Group Observation unique
#> <chr> <dbl> <chr>
#> 1 a 1 1.1
#> 2 a 2 1.2
#> 3 a 3 1.3
#> 4 b 4 2.4
#> 5 b 5 2.5
#> 6 c 6 3.6
#> 7 c 7 3.7
#> 8 c 8 3.8
Created on 2022-07-12 by the reprex package (v2.0.1)

Try this
df |> group_by(Group) |>
mutate(Unique_ID = paste0(cur_group_id(),".",1:n()))
output
# A tibble: 8 × 3
# Groups: Group [3]
Group Observation Unique_ID
<chr> <int> <chr>
1 a 1 1.1
2 a 2 1.2
3 a 3 1.3
4 b 4 2.1
5 b 5 2.2
6 c 6 3.1
7 c 7 3.2
8 c 8 3.3

Related

Calculating means for multiple groups in R

Hello fellow overflowers,
currently I'm trying to calculate means for multiple groups.
My df looks like this (~600 rows):
col1 col2 col3 col4 col5
<type> <gender> <var1> <var2> <var3>
1 A 1 3 2 3
2 A 2 NA 5 NA
3 A 1 3 3 5
4 B 1 4 NA 1
5 B 2 3 4 5
Now the result should look like this:
col1 col2 col3 col4 col5
<type> <gender> <mean-var1> <mean-var2> <mean-var3>
1 A 1 3.6 4.1 4.6
2 A 2 4.1 3.8 4.2
3 B 1 3.9 4.2 3.7
4 B 2 4.3 3.2 2.7
5 C 1 3.5 4.5 3.6
6 C 2 4 3.7 4.2
...
So far, I've tried to use the group_by function:
avg_values<-data%>%
group_by(type, gender) %>%
summarize_all (mean())
So far, it didn't work out. Could you help me figure out a good way to handle this?
Does this work:
library(dplyr)
df %>% group_by(type, gender) %>% summarise(across(var1:var3, ~ mean(., na.rm = T)))
`summarise()` regrouping output by 'type' (override with `.groups` argument)
# A tibble: 4 x 5
# Groups: type [2]
type gender var1 var2 var3
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 3 2.5 4
2 A 2 NaN 5 NaN
3 B 1 4 NaN 1
4 B 2 3 4 5
Data used:
df
# A tibble: 5 x 5
type gender var1 var2 var3
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 3 2 3
2 A 2 NA 5 NA
3 A 1 3 3 5
4 B 1 4 NA 1
5 B 2 3 4 5

Erase groups based on a condition with dplyr [duplicate]

This question already has answers here:
Filter group of rows based on sum of values from different column
(2 answers)
Closed 2 years ago.
I have a data.frame that looks like this
data=data.frame(group=c("A","B","C","A","B","C","A","B","C"),
time= c(rep(1,3),rep(2,3), rep(3,3)),
value=c(0,1,1,0.1,10,20,10,20,30))
group time value
1 A 1 0.0
2 B 1 1.0
3 C 1 1.0
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
I would like to find an elegant way to erase a group when its values are smaller < 0.2 in two different time points. Those points do not have to be consecutive.
In this case, I would like to filter out group A because its value at time point 1 and time point 2 is smaller than < 0.2.
group time value
1 B 1 1.0
2 C 1 1.0
3 B 2 10.0
4 C 2 20.0
5 B 3 20.0
6 C 3 30.0
With this solution you check that no group has more than 1 observation with values under 0.2 as you requested.
library(dplyr)
data %>%
group_by(group) %>%
filter(sum(value < 0.2) < 2) %>%
ungroup()
#> # A tibble: 6 x 3
#> group time value
#> <chr> <dbl> <dbl>
#> 1 B 1 1
#> 2 C 1 1
#> 3 B 2 10
#> 4 C 2 20
#> 5 B 3 20
#> 6 C 3 30
But if you are really a fan of base R:
data[ave(data$value<0.2, data$group, FUN = function(x) sum(x)<2), ]
#> group time value
#> 2 B 1 1
#> 3 C 1 1
#> 5 B 2 10
#> 6 C 2 20
#> 8 B 3 20
#> 9 C 3 30
Try this dplyr approach:
library(tidyverse)
#Code
data <- data %>% group_by(group) %>% mutate(Flag=any(value<0.2)) %>%
filter(Flag==F) %>% select(-Flag)
Output:
# A tibble: 6 x 3
# Groups: group [2]
group time value
<fct> <dbl> <dbl>
1 B 1 1
2 C 1 1
3 B 2 10
4 C 2 20
5 B 3 20
6 C 3 30

calculate grand mean from means in r

I am trying to aggregate a grand mean from mean scores for students. Here is how my dataset looks like:
id <- c(1,1,1, 2,2,2, 3,3, 4,4,4)
mean <- c(5,5,5, 6,6,6, 7,7, 8,8,8)
data <- data.frame(id,mean)
> data
id mean
1 1 5
2 1 5
3 1 5
4 2 6
5 2 6
6 2 6
7 3 7
8 3 7
9 4 8
10 4 8
11 4 8
I am using dplyr package for this calculation. I use this,
data %>%
mutate(grand.mean = mean(mean))
id mean grand.mean
1 1 5 6.454545
2 1 5 6.454545
3 1 5 6.454545
4 2 6 6.454545
5 2 6 6.454545
6 2 6 6.454545
7 3 7 6.454545
8 3 7 6.454545
9 4 8 6.454545
10 4 8 6.454545
11 4 8 6.454545
However, this does not consider repeated means for each id. The calculation should be grabbing unique means from each id and average them over.
so it is (5+6+7+8)/4 = 6.5 instead of 6.45.
Any ideas?
Thanks!
If there are duplicates for mean in different 'id', use match to get the position of the first 'id' and get the mean of the 'mean' column
library(dplyr)
data %>%
mutate(grand.mean = mean(mean[match(unique(id), id)]))
# id mean grand.mean
#1 1 5 6.5
#2 1 5 6.5
#3 1 5 6.5
#4 2 6 6.5
#5 2 6 6.5
#6 2 6 6.5
#7 3 7 6.5
#8 3 7 6.5
#9 4 8 6.5
#10 4 8 6.5
#11 4 8 6.5
Or another option is duplicated
data %>%
mutate(grand.mean = mean(mean[!duplicated(id)]))
Or take the distinct rows. of 'id', 'mean', get the mean, and bind the columns with original dataset
library(tidyr)
data %>%
distinct(id, mean) %>%
summarise(grand.mean = mean(mean)) %>%
uncount(nrow(data)) %>%
bind_cols(data, .)
A base R one-liner could be:
mean(tapply(data$mean, data$id, '[', 1))
#[1] 6.5
To put the result in the original data set do
data$grand.mean <- mean(tapply(data$mean, data$id, '[', 1))
You can use unique and than caluculate mean to get a grand mean.
mean(unique(data)[,"mean"])
#[1] 6.5
Or you can aggregate by id and then caluculate mean to get a grand mean.
mean(aggregate(mean~id, data, base::mean)[,"mean"])
#[1] 6.5
Or use ave to get the number repeated values per id and use this as a weight in weighted.mean.
weighted.mean(mean, 1/ave(id, id, FUN=length))
#[1] 6.5
If you only need a single answer for the grand mean, just use two 'summarise' steps with 'dplyr':
library(dplyr)
data %>%
group_by(id) %>%
summarise(mean = mean(mean)) %>%
summarise(grand.mean = mean(mean))
Result:
grand.mean
<dbl>
1 6.5
Using dplyr, we can group_by id and get the mean of unique mean values in each id, then get the grand_mean of the entire dataset and do a right_join with the original data to add grand_mean as a new column.
library(dplyr)
data %>%
group_by(id) %>%
summarise(grand_mean = mean(unique(mean))) %>%
mutate(grand_mean = mean(grand_mean)) %>%
right_join(data, by = 'id')
# A tibble: 11 x 3
# id grand_mean mean
# <dbl> <dbl> <dbl>
# 1 1 6.5 5
# 2 1 6.5 5
# 3 1 6.5 5
# 4 2 6.5 6
# 5 2 6.5 6
# 6 2 6.5 6
# 7 3 6.5 7
# 8 3 6.5 7
# 9 4 6.5 8
#10 4 6.5 8
#11 4 6.5 8

Create new column based on condition from other column per group using tidy evaluation

Similar to this question but I want to use tidy evaluation instead.
df = data.frame(group = c(1,1,1,2,2,2,3,3,3),
date = c(1,2,3,4,5,6,7,8,9),
speed = c(3,4,3,4,5,6,6,4,9))
> df
group date speed
1 1 1 3
2 1 2 4
3 1 3 3
4 2 4 4
5 2 5 5
6 2 6 6
7 3 7 6
8 3 8 4
9 3 9 9
The task is to create a new column (newValue) whose values equals to the values of the date column (per group) with one condition: speed == 4. Example: group 1 has a newValue of 2 because date[speed==4] = 2.
group date speed newValue
1 1 1 3 2
2 1 2 4 2
3 1 3 3 2
4 2 4 4 4
5 2 5 5 4
6 2 6 6 4
7 3 7 6 8
8 3 8 4 8
9 3 9 9 8
It worked without tidy evaluation
df %>%
group_by(group) %>%
mutate(newValue=date[speed==4L])
#> # A tibble: 9 x 4
#> # Groups: group [3]
#> group date speed newValue
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 3 2
#> 2 1 2 4 2
#> 3 1 3 3 2
#> 4 2 4 4 4
#> 5 2 5 5 4
#> 6 2 6 6 4
#> 7 3 7 6 8
#> 8 3 8 4 8
#> 9 3 9 9 8
But had error with tidy evaluation
my_fu <- function(df, filter_var){
filter_var <- sym(filter_var)
df <- df %>%
group_by(group) %>%
mutate(newValue=!!filter_var[speed==4L])
}
my_fu(df, "date")
#> Error in quos(..., .named = TRUE): object 'speed' not found
Thanks in advance.
We can place the evaluation within brackets. Otherwise, it may try to evaluate the whole expression (filter_var[speed = 4L]) instead of filter_var alone
library(rlang)
library(dplyr)
my_fu <- function(df, filter_var){
filter_var <- sym(filter_var)
df %>%
group_by(group) %>%
mutate(newValue=(!!filter_var)[speed==4L])
}
my_fu(df, "date")
# A tibble: 9 x 4
# Groups: group [3]
# group date speed newValue
# <dbl> <dbl> <dbl> <dbl>
#1 1 1 3 2
#2 1 2 4 2
#3 1 3 3 2
#4 2 4 4 4
#5 2 5 5 4
#6 2 6 6 4
#7 3 7 6 8
#8 3 8 4 8
#9 3 9 9 8
Also, you can use from sqldf. Join df with a constraint on that:
library(sqldf)
df = data.frame(group = c(1,1,1,2,2,2,3,3,3),
date = c(1,2,3,4,5,6,7,8,9),
speed = c(3,4,3,4,5,6,6,4,9))
sqldf("SELECT df_origin.*, df4.`date` new_value FROM
df df_origin join (SELECT `group`, `date` FROM df WHERE speed = 4) df4
on (df_origin.`group` = df4.`group`)")

fill gap in dataframe [duplicate]

This question already has answers here:
adding default values to item x group pairs that don't have a value (df %>% spread %>% gather seems strange)
(2 answers)
Closed 4 years ago.
Original Data
id hhcode value
1 1 4.1
1 2 4.5
1 3 3.3
10 5 3.2
Required Output
id hhcode value
1 1 4.1
1 2 4.5
1 3 3.3
1 5 0
10 1 0
10 2 0
10 3 0
10 5 3.2
What got so far
df <- data.frame(
id = c(1, 1, 1, 10),
hhcode = c(1, 2, 3, 5),
value = c(4.1, 4.5, 3.3, 3.2)
)
library(statar)
library(tidyverse)
df %>%
group_by(id) %>%
fill_gap(hhcode, full = TRUE)
# A tibble: 10 x 3
# Groups: id [2]
id hhcode value
<dbl> <dbl> <dbl>
1 1 1 4.1
2 1 2 4.5
3 1 3 3.3
4 1 4 NA
5 1 5 NA
6 10 1 NA
7 10 2 NA
8 10 3 NA
9 10 4 NA
10 10 5 3.2
Any hint to get the required output?
We could use complete
library(tidyverse)
complete(df, id, hhcode, fill = list(value = 0))
# A tibble: 8 x 3
# id hhcode value
# <dbl> <dbl> <dbl>
#1 1 1 4.1
#2 1 2 4.5
#3 1 3 3.3
#4 1 5 0
#5 10 1 0
#6 10 2 0
#7 10 3 0
#8 10 5 3.2

Resources