R: dplyr and row_number() does not enumerate as expected - r

I want to enumerate each record of a dataframe/tibble resulted from a grouping. The index is according a defined order. If I use row_number() it does enumerate but within group. But I want that it enumerates without considering the former grouping.
Here is an example. To make it simple I used the most minimal dataframe:
library(dplyr)
df0 <- data.frame( x1 = rep(LETTERS[1:2],each=2)
, x2 = rep(letters[1:2], 2)
, y = floor(abs(rnorm(4)*10))
)
df0
# x1 x2 y
# 1 A a 12
# 2 A b 24
# 3 B a 0
# 4 B b 12
Now, I group this table:
df1 <- df0 %>% group_by(x1,x2) %>% summarize(y=sum(y))
This gives me a object of class tibble:
# A tibble: 4 x 3
# Groups: x1 [?]
# x1 x2 y
# <fct> <fct> <dbl>
# 1 A a 12
# 2 A b 24
# 3 B a 0
# 4 B b 12
I want to add a row number to this table using row_numer():
df2 <- df1 %>% arrange(desc(y)) %>% mutate(index = row_number())
df2
# A tibble: 4 x 4
# Groups: x1 [2]
# x1 x2 y index
# <fct> <fct> <dbl> <int>
# 1 A b 24 1
# 2 A a 12 2
# 3 B b 12 1
# 4 B a 0 2
row_number() does enumerate within the former grouping. This was not my intention. This can be avoid converting tibble to a dataframe first:
df2 <- df2 %>% as.data.frame() %>% arrange(desc(y)) %>% mutate(index = row_number())
df2
# x1 x2 y index
# 1 A b 24 1
# 2 A a 12 2
# 3 B b 12 3
# 4 B a 0 4
My question is: is this behaviour intended?
If yes: is it not very dangerous to incorporate former data processing into tibble? Which type of processing is incorporated?
At the moment I will convert tibble into dataframe to avoid this kind of unexpected results.

To elaborate on my comment: yes, retaining grouping is intended, and in many cases useful. It's only dangerous if you don't understand how group_by works—and that's true of any function. To undo group_by, you call ungroup.
Take a look at the group_by docs, as they're very thorough and explain how this function interacts with others, how grouping is layered, etc. The docs also explain how each call to summarise removes a layer of grouping—it might be there that you got confused about what's going on.
For example, you can group by x1 and x2, summarize y, and create a row number, which will give you the rows according to x1 (summarise removed a layer of grouping, i.e. drops the x2 grouping). Then ungrouping allows you to get row numbers based on the entire data frame.
library(dplyr)
df0 %>%
group_by(x1, x2) %>%
summarise(y = sum(y)) %>%
mutate(group_row = row_number()) %>%
ungroup() %>%
mutate(all_df_row = row_number())
#> # A tibble: 4 x 5
#> x1 x2 y group_row all_df_row
#> <fct> <fct> <dbl> <int> <int>
#> 1 A a 12 1 1
#> 2 A b 2 2 2
#> 3 B a 10 1 3
#> 4 B b 23 2 4
A use case—I do this for work probably every day—is to get sums within multiple groups (again, x1 and x2), then to find the shares of those values within their larger group (after peeling away a layer of grouping, this is x1) with mutate. Again, here I ungroup to show the shares instead of the entire data frame.
df0 %>%
group_by(x1, x2) %>%
summarise(y = sum(y)) %>%
mutate(share_in_group = y / sum(y)) %>%
ungroup() %>%
mutate(share_all_df = y / sum(y))
#> # A tibble: 4 x 5
#> x1 x2 y share_in_group share_all_df
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 A a 12 0.857 0.255
#> 2 A b 2 0.143 0.0426
#> 3 B a 10 0.303 0.213
#> 4 B b 23 0.697 0.489
Created on 2018-10-11 by the reprex package (v0.2.1)

As camille nicely showed, there are good reasons for wanting to have the result of summarize() retain additional layers of grouping and it's a documented behaviour so not really dangerous or unexpected per se.
However one additional tip is that if you are just going to call ungroup() after summarize() you might as well use summarize(.groups = "drop") which will return an ungrouped tibble and save you a line of code.
library(tidyverse)
df0 <- data.frame(
x1 = rep(LETTERS[1:2], each = 2),
x2 = rep(letters[1:2], 2),
y = floor(abs(rnorm(4) * 10))
)
df0 %>%
group_by(x1,x2) %>%
summarize(y=sum(y), .groups = "drop") %>%
arrange(desc(y)) %>%
mutate(index = row_number())
#> # A tibble: 4 x 4
#> x1 x2 y index
#> <chr> <chr> <dbl> <int>
#> 1 A b 8 1
#> 2 A a 2 2
#> 3 B a 2 3
#> 4 B b 1 4
Created on 2022-02-06 by the reprex package (v2.0.1)

Related

How to filter out groups empty for 1 column in Tidyverse

tibble(
A = c("A","A","B","B"),
x = c(NA,NA,NA,1),
y = c(1,2,3,4),
) %>% group_by(A) -> df
desired output:
tibble(
A = c("B","B"),
x = c(NA,1)
y = c(3,4),
)
I want to find all groups for which all elements of x and x only are all NA, then remove those groups. "B" is filtered in because it has at least 1 non NA element.
I tried:
df %>%
filter(all(!is.na(x)))
but it seems that filters out if it finds at least 1 NA; I need the correct word, which is not all.
This will remove groups of column A if all elements of x are NA:
library(dplyr)
df %>%
group_by(A) %>%
filter(! all(is.na(x)))
# A tibble: 2 × 3
# Groups: A [1]
# A x y
# <chr> <dbl> <dbl>
#1 B NA 3
#2 B 1 4
Note that group "A" was removed because both cells in the column x are not defined.
We can use any with complete.cases
library(dplyr)
df %>%
group_by(A) %>%
filter(any(complete.cases(x))) %>%
ungroup
-output
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4
In the devel version of dplyr, we could use .by in filter thus we don't need to group_by/ungroup
df %>%
filter(any(complete.cases(x)), .by = 'A')
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4

Spread data with non-unique keys with R

I have the following data frame:
ID
Group
1
A
1
B
2
C
2
D
And I want to reshape the data frame into a wider version in terms of ID. Thus, the new data frame looks like this:
ID
Group1
Group2
1
A
B
2
C
D
You can do this by adding a helper column and then using tidyr::pivot_wider():
library(dplyr)
library(tidyr)
data <- tibble(
id = c(1, 1, 2, 2),
group = letters[1:4]
)
# Add a helper column to use when pivoting. This uses the row number
# over each subgroup, i.e. over each value of `id`
transformed_data <- data %>%
group_by(id) %>%
mutate(helper = paste0("Group", row_number())) %>%
ungroup()
# Here's what the helper column looks like
transformed_data
#> # A tibble: 4 x 3
#> id group helper
#> <dbl> <chr> <chr>
#> 1 1 a Group1
#> 2 1 b Group2
#> 3 2 c Group1
#> 4 2 d Group2
# Pivot the data using the helper column
transformed_data %>%
pivot_wider(names_from = helper, values_from = group)
#> # A tibble: 2 x 3
#> id Group1 Group2
#> <dbl> <chr> <chr>
#> 1 1 a b
#> 2 2 c d

Calculating % of total within groups across each column and transposing

Is there a way to create the following output (assuming a lot of IDs and a lot more attributes)?
I am stuck after calculating the % of total by ATT1 within ID and then ATT2, etc.. Not sure how to go about making the rows into column headers and aggregate.
Input File (df in R):
ID ATT1 ATT2 ATT3 ATT4 Value
1 a x d i 10
1 a y d j 10
1 a y d k 10
1 b y c k 10
1 b y c l 10
2 a x c k 20
…
And I want the output file to look like (ATT4_l is cut off):
ID ATT1_a ATT1_b ATT2_x ATT2_y ATT3_d ATT3_c ATT4_i ATT4_j ATT4_k
1 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.2 0.4
...
I tried using dplyr
df %>% group_by(ID, ATT1) %>% mutate(proc = (Value/sum(Value) * 100))
But I am not sure what to do once I have all the ATT calculated to get them into columns and aggregated so that each ID only has 1 row of data.
You can do this with the two main workhorses of the tidyverse: dplyr for calculations and tidyr for reshaping data. Some of the reshaping is convoluted so I'm breaking it into steps.
library(dplyr)
library(tidyr)
...
If you gather the data from its original wide format into a long format, you'll have a column of IDs, a column of ATTx values, a column of letters (don't know the context meaning of these, so I'm literally calling it letters), and a column of values. From this format, you can group observations by combinations of ID, ATT, and letter, and you can later stick ATTs and letters together in the way you've laid out.
df %>%
gather(key = att, value = letter, -ID, -Value) %>%
head()
#> # A tibble: 6 x 4
#> ID Value att letter
#> <int> <int> <chr> <chr>
#> 1 1 10 ATT1 a
#> 2 1 10 ATT1 a
#> 3 1 10 ATT1 a
#> 4 1 10 ATT1 b
#> 5 1 10 ATT1 b
#> 6 2 20 ATT1 a
After grouping, calculate total values for each ID/ATT/letter combo:
df %>%
gather(key = att, value = letter, -ID, -Value) %>%
group_by(ID, att, letter) %>%
summarise(group_val = sum(Value)) %>%
head()
#> # A tibble: 6 x 4
#> # Groups: ID, att [3]
#> ID att letter group_val
#> <int> <chr> <chr> <int>
#> 1 1 ATT1 a 30
#> 2 1 ATT1 b 20
#> 3 1 ATT2 x 10
#> 4 1 ATT2 y 40
#> 5 1 ATT3 c 20
#> 6 1 ATT3 d 30
Using mutate, you can calculate the share of each observation within its larger group. mutate drops one layer of the grouping hierarchy, so this is the share of values for each letter within a given ID and ATT. Since you no longer need the total values, just their shares, drop that column, and stick the ATTs and letters back together with unite.
df %>%
gather(key = att, value = letter, -ID, -Value) %>%
group_by(ID, att, letter) %>%
summarise(group_val = sum(Value)) %>%
mutate(share = group_val / sum(group_val)) %>%
select(-group_val) %>%
unite(group, att, letter, sep = "_") %>%
head()
#> # A tibble: 6 x 3
#> # Groups: ID [1]
#> ID group share
#> <int> <chr> <dbl>
#> 1 1 ATT1_a 0.6
#> 2 1 ATT1_b 0.4
#> 3 1 ATT2_x 0.2
#> 4 1 ATT2_y 0.8
#> 5 1 ATT3_c 0.4
#> 6 1 ATT3_d 0.6
Now you have all the information you're looking for, just need to get it into a wide format, turning the values in the group column into individual columns. You do this with spread:
df %>%
gather(key = att, value = letter, -ID, -Value) %>%
group_by(ID, att, letter) %>%
summarise(group_val = sum(Value)) %>%
mutate(share = group_val / sum(group_val)) %>%
select(-group_val) %>%
unite(group, att, letter, sep = "_") %>%
spread(key = group, value = share)
#> # A tibble: 2 x 11
#> # Groups: ID [2]
#> ID ATT1_a ATT1_b ATT2_x ATT2_y ATT3_c ATT3_d ATT4_i ATT4_j ATT4_k
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.6 0.4 0.2 0.8 0.4 0.6 0.2 0.2 0.4
#> 2 2 1 NA 1 NA 1 NA NA NA 1
#> # ... with 1 more variable: ATT4_l <dbl>
Note that there are NAs filled in here where there aren't observations for combinations of ID/ATT/letter. I'm assuming you'll have more complete data than in the sample you posted.
Created on 2018-10-03 by the reprex package (v0.2.1)
I believe you are looking for the reshape2 package
library(reshape2)
df.new <- dcast(df,
formula = ID~ATT1,
value.var = "proc",
fun.aggregate = mean)
This will not completely fix your problem though - I recommend doing this first to make your data tidy
df.tidy <- melt(df,
id.vars = c("ID","Value"),
variable.name = "ATT1_4",
value.name = "att.factor")
df.tidy <- df.tidy %>% group_by(ID, att.factor) %>% mutate(proc = (Value/sum(Value)*100))
df.new <- dcast(df.tidy,
formula = ID~att.factor,
value.var = "proc",
fun.aggregate = mean)
NaN will be returned for anything combination that isnt represented in df.tidy. you can use the fill argument to assign a value to those.

dplyr sample_n by group with unique size argument per group

I am trying to draw a stratified sample from a data set for which a variable exists that indicates how large the sample size per group should be.
library(dplyr)
# example data
df <- data.frame(id = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
In this example, grp refers to the group I want to sample by and frq is the sample size specificied for that group.
Using split, I came up with this possible solution, which gives the desired result but seems rather inefficient :
s <- split(df, df$grp)
lapply(s,function(x) sample_n(x, size = unique(x$frq))) %>%
do.call(what = rbind)
Is there a way using just dplyr's group_by and sample_n to do this?
My first thought was:
df %>% group_by(grp) %>% sample_n(size = frq)
but this gives the error:
Error in is_scalar_integerish(size) : object 'frq' not found
This works:
df %>% group_by(grp) %>% sample_n(frq[1])
# A tibble: 9 x 3
# Groups: grp [3]
id grp frq
<int> <int> <dbl>
1 3 1 3
2 4 1 3
3 2 1 3
4 6 2 2
5 8 2 2
6 13 3 4
7 14 3 4
8 12 3 4
9 11 3 4
Not sure why it didn't work when you tried it.
library(tidyverse)
# example data
df <- data.frame(id = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
set.seed(22)
df %>%
group_by(grp) %>% # for each group
nest() %>% # nest data
mutate(v = map(data, ~sample_n(data.frame(id=.$id), unique(.$frq)))) %>% # sample using id values and (unique) frq value
unnest(v) # unnest the sampled values
# # A tibble: 9 x 2
# grp id
# <int> <int>
# 1 1 2
# 2 1 5
# 3 1 3
# 4 2 8
# 5 2 9
# 6 3 14
# 7 3 13
# 8 3 15
# 9 3 11
Function sample_n works if you pass as inputs a data frame of ids (not a vector of ids) and one frequency value (for each group).
An alternative version using map2 and generating the inputs for sample_n in advance:
df %>%
group_by(grp) %>% # for every group
summarise(d = list(data.frame(id=id)), # create a data frame of ids
frq = unique(frq)) %>% # get the unique frq value
mutate(v = map2(d, frq, ~sample_n(.x, .y))) %>% # sample using data frame of ids and frq value
unnest(v) %>% # unnest sampled values
select(-frq) # remove frq column (if needed)
The following answer is not recommended, just shows a different approach without nests/maps that some people might find more comprehensible. Possibly of use to someone working with a smallish data set who wants to do something slightly different to the original question, is a bit scared or doesn't have time to play around with functions they don't really understand, and isn't too worried about efficiency. You just need to recall the behaviour of the original sample function in base R: when provided with a (positive) integer argument x, it outputs a vector randomly permuting the integers from 1:x.
> sample(5)
[1] 5 1 4 2 3
If we had five elements, we could then obtain a random sample of size three by only selecting the positions where 1, 2 and 3 were permuted - in this case we'd pick the second, fourth and fifth elements. All clear? Then similarly we can just do that within each group, assigning random integers from 1 to the group size, and choosing as our sample the places where the random id is less than or equal to the desired sample size for that group.
library(tidyverse)
# The iris data set has three different species
# I want to sample 2, 5 and 3 flowers respectively from each
sample_sizes <- data.frame(
Species = unique(iris$Species),
n_to_sample = c(2, 5, 3)
)
iris %>%
left_join(sample_sizes, by = "Species") %>% # adds column for how many to sample from this species
group_by(Species) %>% # each species is a group, the size of the group can be found by n()
mutate(random_id = sample(n())) %>% # give each flower in the group a random id between 1 and n()
ungroup() %>%
filter(random_id <= n_to_sample)
Which gave me the output:
# A tibble: 10 x 7
Sepal.Length Sepal.Width Petal.Length Petal.Width Species n_to_sample random_id
<dbl> <dbl> <dbl> <dbl> <fct> <dbl> <int>
1 4.9 3.1 1.5 0.1 setosa 2 1
2 5.7 4.4 1.5 0.4 setosa 2 2
3 6.2 2.2 4.5 1.5 versicolor 5 3
4 6.3 2.5 4.9 1.5 versicolor 5 2
5 6.4 2.9 4.3 1.3 versicolor 5 5
6 6 2.9 4.5 1.5 versicolor 5 4
7 5.5 2.4 3.8 1.1 versicolor 5 1
8 7.3 2.9 6.3 1.8 virginica 3 1
9 7.2 3 5.8 1.6 virginica 3 3
10 6.2 3.4 5.4 2.3 virginica 3 2
You can of course pipe through to select(-random_id, -n_to_sample) if you no longer have any use for the final two columns, but I left them in so it's clearer from the output how the code worked.
For the example data given in the question:
library(dplyr)
# example data
df <- data.frame(id = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
df %>%
group_by(grp) %>%
mutate(random_id = sample(n())) %>%
ungroup() %>%
filter(random_id <= frq) %>%
select(-random_id)
# A tibble: 9 x 3
id grp frq
<int> <int> <dbl>
1 1 1 3
2 2 1 3
3 3 1 3
4 8 2 2
5 9 2 2
6 11 3 4
7 12 3 4
8 13 3 4
9 15 3 4
NB if you're a safety fanatic and x might be zero, and you want to guarantee the length of the output is definitely the same as x, you're better to do sample(seq_len(x)) than sample(x). That way you get the zero-length vector integer(0) rather than the length-one vector 0 in the case where x is zero. In my code, the mutate will never be working on a row for which n() is zero (if n() were zero then that group is empty so there couldn't be a row there) and this isn't a problem. Just something to be aware of if you're taking this approach somewhere else.
Benchmarks for comparison:
f1 <- function(df) { # #AntoniosK with nest and map
df %>%
group_by(grp) %>% # for each group
nest() %>% # nest data
mutate(v = map(data, ~sample_n(data.frame(id=.$id), unique(.$frq)))) %>% # sample using id values and (unique) frq value
unnest(v) # unnest the sampled values
}
f2 <- function(df) { # #AntoniosK with nest and map2
df %>%
group_by(grp) %>% # for every group
summarise(d = list(data.frame(id=id)), # create a data frame of ids
frq = unique(frq)) %>% # get the unique frq value
mutate(v = map2(d, frq, ~sample_n(.x, .y))) %>% # sample using data frame of ids and frq value
unnest(v) %>% # unnest sampled values
select(-frq) # remove frq column (if needed)
}
f3 <- function(df) { # #thc
df %>% group_by(grp) %>% sample_n(frq[1])
}
f4 <- function(df) { # #Silverfish
df %>%
group_by(grp) %>%
mutate(random_id = sample(n())) %>%
ungroup() %>%
filter(random_id <= frq) %>%
select(-random_id)
}
# example data of variable size
df_n <- function(n) {
data.frame(id = seq_len(3*n),
grp = rep(1:3,each = n),
frq = rep(c(3,2,4), each = n))
}
require(microbenchmark)
microbenchmark(f1(df_n(1e3)), f2(df_n(1e3)), f3(df_n(1e3)), f4(df_n(1e3)),
f1(df_n(1e6)), f2(df_n(1e6)), f3(df_n(1e6)), f4(df_n(1e6)),
times=20)
Results strongly favour #thc's df %>% group_by(grp) %>% sample_n(frq[1]) both for data frame with a couple of thousand or couple of million rows. My naive approach takes two or three times as long, and #AntoniosK's faster solution is the one with nest and map2 (worse than mine for smaller data frames but better for the larger ones).
Unit: milliseconds
expr min lq mean median uq max neval
f1(df_n(1000)) 12.0007 12.27295 12.479760 12.34190 12.46475 13.6403 20
f2(df_n(1000)) 9.5841 9.82185 9.905120 9.87820 9.98865 10.2993 20
f3(df_n(1000)) 1.3729 1.53470 1.593015 1.56755 1.68910 1.8456 20
f4(df_n(1000)) 3.1732 3.21600 3.558855 3.27500 3.57350 5.4715 20
f1(df_n(1e+06)) 1582.3807 1695.15655 1699.288195 1714.13435 1727.53300 1744.2654 20
f2(df_n(1e+06)) 323.3649 336.94280 407.581130 346.95390 463.69935 911.6647 20
f3(df_n(1e+06)) 216.3265 235.85830 268.756465 247.63620 259.02640 395.9372 20
f4(df_n(1e+06)) 641.5119 663.03510 737.089355 682.69730 803.98205 1132.6586 20

dplyr: passing a grouped tibble to a custom function

(The following scenario simplifies my actual situation)
My data comes from villages, and I would like to summarize an outcome variable by a village variable.
> data
village A Z Y
<chr> <int> <int> <dbl>
1 a 1 1 500
2 a 1 1 400
3 a 1 0 800
4 b 1 0 300
5 b 1 1 700
For example, I would like to calculate the mean of Y only using Z==z by villages. In this case, I want to have (500 + 400)/2 = 450 for village "a" and 700 for village "b".
Please note that the actual situation is more complicated and I cannot directly use this answer, but the point is I need to pass a grouped tibble and a global variable (z) to my function.
z <- 1 # z takes 0 or 1
data %>%
group_by(village) %>% # grouping by village
summarize(Y_village = Y_hat_village(., z)) # pass a part of tibble and a global variable
Y_hat_village <- function(data_village, z){
# This function takes a part of tibble (`data_village`) and a variable `z`
# Calculate the mean for a specific z in a village
data_z <- data_village %>% filter(Z==get("z"))
return(mean(data_z$Y))
}
However, I found . passes entire tibble and the code above returns the same values for all groups.
There are a couple things you can simplify. One is in your function: since you're passing in a value z to the function, you don't need to use get("z"). You have a z in the global environment that you pass in; or, more safely, assign your z value to a variable with some other name so you don't run into scoping issues, and pass that in to the function. In this case, I'm calling it z_val.
library(tidyverse)
z_val <- 1
Y_hat_village2 <- function(data, z) {
data_z <- data %>% filter(Z == z)
return(mean(data_z$Y))
}
You can make the function call on each group using do, which will get you a list-column, and then unnesting that column. Again note that I'm passing in the variable z_val to the argument z.
df %>%
group_by(village) %>%
do(y_hat = Y_hat_village2(., z = z_val)) %>%
unnest()
#> # A tibble: 2 x 2
#> village y_hat
#> <chr> <dbl>
#> 1 a 450
#> 2 b 700
However, do is being deprecated in favor of purrr::map, which I am still having trouble getting the hang of. In this case, you can group and nest, which gives a column of data frames called data, then map over that column and again supply z = z_val. When you unnest the y_hat column, you still have the original data as a nested column, since you wanted access to the rest of the columns still.
df %>%
group_by(village) %>%
nest() %>%
mutate(y_hat = map(data, ~Y_hat_village2(., z = z_val))) %>%
unnest(y_hat)
#> # A tibble: 2 x 3
#> village data y_hat
#> <chr> <list> <dbl>
#> 1 a <tibble [3 × 3]> 450
#> 2 b <tibble [2 × 3]> 700
Just to check that everything works okay, I also passed in z = 0 to check for 1. scoping issues, and 2. that other values of z work.
df %>%
group_by(village) %>%
nest() %>%
mutate(y_hat = map(data, ~Y_hat_village2(., z = 0))) %>%
unnest(y_hat)
#> # A tibble: 2 x 3
#> village data y_hat
#> <chr> <list> <dbl>
#> 1 a <tibble [3 × 3]> 800
#> 2 b <tibble [2 × 3]> 300
As an extension/modification to #patL's answer, you can also wrap the tidyverse solution within purrr:map to return a list of two tibbles, one for each z value:
z <- c(0, 1);
map(z, ~df %>% filter(Z == .x) %>% group_by(village) %>% summarise(Y.mean = mean(Y)))
#[[1]]
## A tibble: 2 x 2
# village Y.mean
# <fct> <dbl>
#1 a 800.
#2 b 300.
#
#[[2]]
## A tibble: 2 x 2
# village Y.mean
# <fct> <dbl>
#1 a 450.
#2 b 700.
Sample data
df <- read.table(text =
" village A Z Y
1 a 1 1 500
2 a 1 1 400
3 a 1 0 800
4 b 1 0 300
5 b 1 1 700 ", header = T)
You can use dplyr to accomplish it:
library(dplyr)
df %>%
group_by(village) %>%
filter(Z == 1) %>%
summarise(Y_village = mean(Y))
## A tibble: 2 x 2
# village Y_village
# <chr> <dbl>
#1 a 450
#2 b 700
To get all columns:
df %>%
group_by(village) %>%
filter(Z == 1) %>%
mutate(Y_village = mean(Y)) %>%
distinct(village, A, Z, Y_village)
## A tibble: 2 x 4
## Groups: village [2]
# village A Z Y_village
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 450
#2 b 1 1 700
data
df <- data_frame(village = c("a", "a", "a", "b", "b"),
A = rep(1, 5),
Z = c(1, 1, 0, 0, 1),
Y = c(500, 400, 800, 30, 700))

Resources