I have objects containing monthly data on plant growth. Each object is a fixed number of columns, and the number of rows is equal to the number of months the plant survives. I would like to take the mean of these objects so that the mean considers only plants surviving at a given timestep. Here is example data:
df1 <- data.frame(GPP = 1:10, NPP = 1:10)
df2 <- data.frame(GPP = 2:8, NPP = 2:8)
df3 <- data.frame(GPP = 3:9, NPP = 3:9 )
In this scenario, the maximum timesteps is 10, and the 2nd and 3rd plants did not survive this long. To take the mean, my initial thought was to replace empty space with NA to make the dimensions the same, such as this:
na <- matrix( , nrow = 3, ncol = 2)
colnames(na) <- c("GPP","NPP")
df2 <- rbind(df2, na)
df3 <- rbind(df3, na)
This is not desirable because the NA does not simply ignore the value as I had hoped, but nullifies the field, leading to all outputs of arithmetic with NA becoming NA, such as this:
(df1 + df2 + df3) / 3
GPP NPP
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
6 7 7
7 8 8
8 NA NA
9 NA NA
10 NA NA
I can NOT just fill na with 0s because I want to see the mean of every plant that is living at a given timestep while completely ignoring those that have died. Replacing with 0s would skew the mean, and not achieve this. For my example data here, this is the desired outcome
(df1 + df2 + df3) / 3
GPP NPP
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
6 7 7
7 8 8
8 8 8
9 9 9
10 10 10
Here rows 8-10 are replaced with the values from df1 because there are only 7 rows in both df2 and df3.
I'll restate my comment: it is generally much safer to encode the month in the original data before you do anything else; it is explicit and will insulate you from mistakes later in the pipeline that might inadvertently change the order of rows (which completely breaks any valid significance you hope to attain). Additionally, since I'm going to recommend putting all data into one frame, let's encode the plant number as well (even if we don't use it immediately here).
For that, then:
df1 <- data.frame(plant = "A", month = 1:10, GPP = 1:10, NPP = 1:10)
df2 <- data.frame(plant = "B", month = 1:7, GPP = 2:8, NPP = 2:8)
df3 <- data.frame(plant = "C", month = 1:7, GPP = 3:9, NPP = 3:9)
From this, I'm a huge fan of having all data in one frame. This is well-informed by https://stackoverflow.com/a/24376207/3358227, where a premise is that if you're going to do the same thing to a bunch of frames, it should either be a list-of-frames or one combined frame (that keeps the source id encoded):
dfs <- do.call(rbind, list(df1, df2, df3))
### just a sampling, for depiction
dfs[c(1:2, 10:12, 17:19),]
# plant month GPP NPP
# 1 A 1 1 1
# 2 A 2 2 2
# 10 A 10 10 10
# 11 B 1 2 2
# 12 B 2 3 3
# 17 B 7 8 8
# 18 C 1 3 3
# 19 C 2 4 4
base R
aggregate(cbind(GPP, NPP) ~ month, data = dfs, FUN = mean, na.rm = TRUE)
# month GPP NPP
# 1 1 2 2
# 2 2 3 3
# 3 3 4 4
# 4 4 5 5
# 5 5 6 6
# 6 6 7 7
# 7 7 8 8
# 8 8 8 8
# 9 9 9 9
# 10 10 10 10
dplyr
library(dplyr)
dfs %>%
group_by(month) %>%
summarize(across(c(GPP, NPP), mean))
# # A tibble: 10 x 3
# month GPP NPP
# <int> <dbl> <dbl>
# 1 1 2 2
# 2 2 3 3
# 3 3 4 4
# 4 4 5 5
# 5 5 6 6
# 6 6 7 7
# 7 7 8 8
# 8 8 8 8
# 9 9 9 9
# 10 10 10 10
Side point: two bits of data you are "losing" in this summary is the size of data and the variability of each month. You might include them with:
dfs %>%
group_by(month) %>%
summarize(across(c(GPP, NPP), list(mu = ~ mean(.), sigma = ~ sd(.), len = ~ length(.))))
# # A tibble: 10 x 7
# month GPP_mu GPP_sigma GPP_len NPP_mu NPP_sigma NPP_len
# <int> <dbl> <dbl> <int> <dbl> <dbl> <int>
# 1 1 2 1 3 2 1 3
# 2 2 3 1 3 3 1 3
# 3 3 4 1 3 4 1 3
# 4 4 5 1 3 5 1 3
# 5 5 6 1 3 6 1 3
# 6 6 7 1 3 7 1 3
# 7 7 8 1 3 8 1 3
# 8 8 8 NA 1 8 NA 1
# 9 9 9 NA 1 9 NA 1
# 10 10 10 NA 1 10 NA 1
In this case, an average of 8 may be meaningful, but noting that it is a length of 1 is also informative of the "strength" of that statistic (i.e., weak).
library(dplyr)
df1 <- data.frame(month = 1:10, GPP = 1:10, NPP = 1:10)
df2 <- data.frame(month = 1:7, GPP = 2:8, NPP = 2:8)
df3 <- data.frame(month = 1:7, GPP = 3:9, NPP = 3:9 )
df <- rbind(df1, df2, df3)
df %>%
group_by(month) %>%
summarise(GPP = mean(GPP),
NPP = mean(NPP))
month GPP NPP
<int> <dbl> <dbl>
1 1 2 2
2 2 3 3
3 3 4 4
4 4 5 5
5 5 6 6
6 6 7 7
7 7 8 8
8 8 8 8
9 9 9 9
10 10 10 10
Using data.table
library(data.table)
rbindlist(mget(ls(pattern = '^df\\d+$')))[, lapply(.SD, mean), month]
Related
I have a dataframe with several columns containing list columns that I want to unnest (or unchop). BUT, they are different lengths, so the resulting error is Error: No common size for...
Here is a reprex to show what works and doesn't work.
library(tidyr)
library(vctrs)
# This works as expected
df_A <- tibble(
ID = 1:3,
A = as_list_of(list(c(9, 8, 5), c(7,6), c(6, 9)))
)
unchop(df_A, cols = c(A))
# A tibble: 7 x 2
ID A
<int> <dbl>
1 1 9
2 1 8
3 1 5
4 2 7
5 2 6
6 3 6
7 3 9
# This works as expected as the lists are the same lengths
df_AB_1 <- tibble(
ID = 1:3,
A = as_list_of(list(c(9, 8, 5), c(7,6), c(6, 9))),
B = as_list_of(list(c(1, 2, 3), c(4, 5), c(7, 8)))
)
unchop(df_AB_1, cols = c(A, B))
# A tibble: 7 x 3
ID A B
<int> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 3
4 2 7 4
5 2 6 5
6 3 6 7
7 3 9 8
# This does NOT work as the lists are different lengths
df_AB_2 <- tibble(
ID = 1:3,
A = as_list_of(list(c(9, 8, 5), c(7,6), c(6, 9))),
B = as_list_of(list(c(1, 2), c(4, 5, 6), c(7, 8, 9, 0)))
)
unchop(df_AB_2, cols = c(A, B))
# Error: No common size for `A`, size 3, and `B`, size 2.
The output that I would like to achieve for df_AB_2 above is as follows where each list is unchopped and missing values are filled with NA:
# A tibble: 10 x 3
ID A B
<dbl> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 NA
4 2 7 4
5 2 6 5
6 2 NA 6
7 3 6 7
8 3 9 8
9 3 NA 9
10 3 NA 0
I have referenced this issue on Github and StackOverflow here.
Any ideas how to achieve the result above?
Versions
> packageVersion("tidyr")
[1] ‘1.0.0’
> packageVersion("vctrs")
[1] ‘0.2.0.9001’
Here is an idea via dplyr that you can generalise to as many columns as you want,
library(tidyverse)
df_AB_2 %>%
pivot_longer(c(A, B)) %>%
mutate(value = lapply(value, `length<-`, max(lengths(value)))) %>%
pivot_wider(names_from = name, values_from = value) %>%
unnest() %>%
filter(rowSums(is.na(.[-1])) != 2)
which gives,
# A tibble: 10 x 3
ID A B
<int> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 NA
4 2 7 4
5 2 6 5
6 2 NA 6
7 3 6 7
8 3 9 8
9 3 NA 9
10 3 NA 0
Defining a helper function to update the lengths of the element and proceeding with dplyr:
foo <- function(x, len_vec) {
lapply(
seq_len(length(x)),
function(i) {
length(x[[i]]) <- len_vec[i]
x[[i]]
}
)
}
df_AB_2 %>%
mutate(maxl = pmax(lengths(A), lengths(B))) %>%
mutate(A = foo(A, maxl), B = foo(B, maxl)) %>%
unchop(cols = c(A, B)) %>%
select(-maxl)
# A tibble: 10 x 3
ID A B
<int> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 NA
4 2 7 4
5 2 6 5
6 2 NA 6
7 3 6 7
8 3 9 8
9 3 NA 9
10 3 NA 0
Using data.table:
library(data.table)
setDT(df_AB_2)
df_AB_2[, maxl := pmax(lengths(A), lengths(B))]
df_AB_2[, .(unlist(A)[seq_len(maxl)], unlist(B)[seq_len(maxl)]), by = ID]
I have a data frame as below:
df <- data.frame(
id = c(1:5),
a = c(3,10,4,0,15),
b = c(2,1,1,0,3),
c = c(12,3,0,3,1),
d = c(9,7,8,0,0),
e = c(1,2,0,2,2)
)
I need to add multiple columns of which names are given by a combination of a:c and 3:5. 3:5 is also used insum function:
df %>% mutate(
usa_3 = sum(1+3),
usa_4 = sum(1+4),
usa_5 = sum(1+5),
canada_3 = sum(1+3),
canada_4 = sum(1+4),
canada_5 = sum(1+5),
nz_3 = sum(1+3),
nz_4 = sum(1+4),
nz_5 = sum(1+5)
)
The result is really simple but I do not want to put similar codes repeatedly.
id a b c d e usa_3 usa_4 usa_5 canada_3 canada_4 canada_5 nz_3 nz_4 nz_5
1 1 3 2 12 9 1 4 5 6 4 5 6 4 5 6
2 2 10 1 3 7 2 4 5 6 4 5 6 4 5 6
3 3 4 1 0 8 0 4 5 6 4 5 6 4 5 6
4 4 0 0 3 0 2 4 5 6 4 5 6 4 5 6
5 5 15 3 1 0 2 4 5 6 4 5 6 4 5 6
The variables are alphabetical prefix and range of integers as postfix.
Postfix is also related to the sum funcion as 1+postfix.
In this case, they have 3 values for each so the result have 9 additional columns.
I do not prefer to define function outside the a bunch of codes and suppose map functino in purrr may help it.
Do you know how to make it work?
Especially it is difficult to give dynamic column name in pipe.
I found some similar questions but it does not match my need.
Multivariate mutate
How to use map from purrr with dplyr::mutate to create multiple new columns based on column pairs
===== ADDITIONAL INFO =====
Let me clarify some conditions of this issue.
Actually sum(1+3), sum(1+4)... part is replaced by as.factor(cutree(X,k=X)) where X is reuslt of cluster analysis and Y is a variable defined as 3:5 in the example. cutree() is a function to define in which part we cut a dendrogram stored in the result of cluster analysis.
As for the column names usa_3, usa_4 ... nz_5, country name is replaced by methods of cluster analysis such as ward, McQuitty, Median method, etc. (seven methods), and integers 3, 4, 5, are the parameter to define in which part I need to cut a dendrogram as explained.
As for an X in the functionas.factor(cutree(X,k=X)), results of cluster analysis also have several data frame which is corresponded to each method. I realized that another issue how to apply the function to each data frame (result of cluster analysis stored in different dataframe).
Actual scripts that I am using currently is something like this:
cluste_number <- original_df %>% mutate(
## Ward
ward_3=as.factor(cutree(clst.ward,k=3)),
ward_4=as.factor(cutree(clst.ward,k=4)),
ward_5=as.factor(cutree(clst.ward,k=5)),
ward_6=as.factor(cutree(clst.ward,k=6)),
## Single
sing_3=as.factor(cutree(clst.sing,k=3)),
sing_4=as.factor(cutree(clst.sing,k=4)),
sing_5=as.factor(cutree(clst.sing,k=5)),
sing_6=as.factor(cutree(clst.sing,k=6)))
It is sorry not to clarify the actual issue; howerver, due to this reason above, number of countries as usa, canada, nz and number of parameters as 1:3 do not match.
Also some suggestions using i + . does not meet the issue as a function as.factor(cutree(X,k=X)) is used in the actual operation.
Thank you for your support.
Not sure what you are up to, but maybe this helps to clarify the issue ..
library(tidyverse)
df <- data.frame(
id = c(1:5),
a = c(3,10,4,0,15),
b = c(2,1,1,0,3),
c = c(12,3,0,3,1),
d = c(9,7,8,0,0),
e = c(1,2,0,2,2)
)
ctry <- rep(c("usa", "ca", "nz"), each = 3)
nr <- rep(seq(3,5), times = 3)
df %>%
as_tibble() %>%
bind_cols(map_dfc(seq_along(ctry), ~1+nr[.x] %>%
rep(nrow(df))) %>%
set_names(str_c(ctry, nr, sep = "_")))
# A tibble: 5 x 15
id a b c d e usa_3 usa_4 usa_5 ca_3 ca_4 ca_5 nz_3 nz_4 nz_5
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 3 2 12 9 1 4 5 6 4 5 6 4 5 6
2 2 10 1 3 7 2 4 5 6 4 5 6 4 5 6
3 3 4 1 0 8 0 4 5 6 4 5 6 4 5 6
4 4 0 0 3 0 2 4 5 6 4 5 6 4 5 6
5 5 15 3 1 0 2 4 5 6 4 5 6 4 5 6
I'm not sure if I understand the spirit of the problem, but here is one way to generate a data frame with the column names and values you want.
You can change ~ function(i) i + . to be whatever function of i (the column being mutated) you want, and change either of the ns in setNames(n, n) to incorporate a different value into the function you're creating (first n) or change the names of the resulting columns (second n).
countries <- c('usa', 'canada', 'nz')
n <- 3:5
as.data.frame(matrix(1, nrow(df), length(n))) %>%
rename_all(~countries) %>%
mutate_all(map(setNames(n, n), ~ function(i) i + .)) %>%
select(-countries) %>%
bind_cols(df)
# usa_3 canada_3 nz_3 usa_4 canada_4 nz_4 usa_5 canada_5 nz_5 id a b c d e
# 1 4 4 4 5 5 5 6 6 6 1 3 2 12 9 1
# 2 4 4 4 5 5 5 6 6 6 2 10 1 3 7 2
# 3 4 4 4 5 5 5 6 6 6 3 4 1 0 8 0
# 4 4 4 4 5 5 5 6 6 6 4 0 0 3 0 2
# 5 4 4 4 5 5 5 6 6 6 5 15 3 1 0 2
Kinda of a dirty solution, but it does what you want. It combines two map_dfc functions.
library(dplyr)
library(purrr)
df <- tibble(id = c(1:5),
a = c(3,10,4,0,15),
b = c(2,1,1,0,3),
c = c(12,3,0,3,1),
d = c(9,7,8,0,0),
e = c(1,2,0,2,2))
create_postfix_cols <- function(df, country, n) {
# df = a dataframe
# country = suffix value (e.g. "canada")
# n = vector of postfix values (e.g. 3:5)
map2_dfc(.x = rep(country, length(n)),
.y = n,
~ tibble(col = rep(1 + .y, nrow(df))) %>%
set_names(paste(.x, .y, sep = "_")))
}
countries <- c("usa", "canada", "nz")
n <- 3:5
df %>%
bind_cols(map_dfc(.x = countries, ~create_postfix_cols(df, .x, n)))
# A tibble: 5 x 15
id a b c d e usa_3 usa_4 usa_5 canada_3 canada_4 canada_5
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 3 2 12 9 1 4 5 6 4 5 6
2 2 10 1 3 7 2 4 5 6 4 5 6
3 3 4 1 0 8 0 4 5 6 4 5 6
4 4 0 0 3 0 2 4 5 6 4 5 6
5 5 15 3 1 0 2 4 5 6 4 5 6
# ... with 3 more variables: nz_3 <dbl>, nz_4 <dbl>, nz_5 <dbl>
Here is a base R solution. You can rearrange columns if you would like, but this should get your started:
# Create column names using an index and country names
idx <- 3:5
countries <- c("usa", "canada", "nz")
new_columns <- unlist(lapply(countries, paste0, "_", idx))
# Adding new values using index & taking advantage of recycling
df[new_columns] <- sort(rep(1+idx, nrow(df)))
df
id a b c d e usa_3 usa_4 usa_5 canada_3 canada_4 canada_5 nz_3 nz_4 nz_5
1 1 3 2 12 9 1 4 5 6 4 5 6 4 5 6
2 2 10 1 3 7 2 4 5 6 4 5 6 4 5 6
3 3 4 1 0 8 0 4 5 6 4 5 6 4 5 6
4 4 0 0 3 0 2 4 5 6 4 5 6 4 5 6
5 5 15 3 1 0 2 4 5 6 4 5 6 4 5 6
Or, if you prefer:
# All in one long line
df[unlist(lapply(countries, paste0, "_", idx))] <- sort(rep(1+idx, nrow(df)))
Using the new grammar in dplyr 0.8.0 using list() instead of funs(), I want to be able to create new variables from mutate_at() without overwriting the old. Basically, I need to replace any integers over a value with NA in several columns, without overwriting the columns.
I had this working already using a previous version of dplyr, but I want to accommodate the changes in dplyr so my code doesn't break later.
Say I have a tibble:
x <- tibble(id = 1:10, x = sample(1:10, 10, replace = TRUE),
y = sample(1:10, 10, replace = TRUE))
I want to be able to replace any values above 5 with NA. I used to do it this way, and this result is exactly what I want:
x %>% mutate_at(vars(x, y), funs(RC = replace(., which(. > 5), NA)))
# A tibble: 10 x 5
id x y x_RC y_RC
<int> <int> <int> <int> <int>
1 1 2 3 2 3
2 2 2 1 2 1
3 3 3 4 3 4
4 4 4 4 4 4
5 5 2 9 2 NA
6 6 6 8 NA NA
7 7 10 2 NA 2
8 8 1 3 1 3
9 9 10 1 NA 1
10 10 1 8 1 NA
This what I've tried, but it doesn't work:
x %>% mutate_at(vars(x, y), list(RC = replace(., which(. > 5), NA)))
Error in [<-.data.frame(*tmp*, list, value = NA) :
new columns would leave holes after existing columns
This works, but replaces the original variables:
x %>% mutate_at(vars(x, y), list(~replace(., which(. > 5), NA)))
# A tibble: 10 x 3
id x y
<int> <int> <int>
1 1 2 3
2 2 2 1
3 3 3 4
4 4 4 4
5 5 2 NA
6 6 NA NA
7 7 NA 2
8 8 1 3
9 9 NA 1
10 10 1 NA
Any help is appreciated!
Almost there, just create a named list.
x %>% mutate_at(vars(x, y), list(RC = ~replace(., which(. > 5), NA)))
I am new to R and I am having trouble with finding a way how to restructure my data.
Currently I have 365 different data frames, each representing a day of the year. In each data frame there is point of sales data, so how many items of a product is sold at each store per day. There are four columns in each data frame; namely ShopId, ArticleId, Date (which is constant in one data frame), and AmountSold.
Now I want to restructure my data frames to be able to forecast how many items I need per product per store per day. I would like to do this by clustering either all data points corresponding to a certain store (ShopId), or by clustering all data points corresponding to a certain product (ArticleId) in separate data frames. The problem is that I do not know how to do this. I already have a list of all my data frames, which I coded like this:
l.df <- lapply(ls(), function(x) if (class(get(x)) == "data.frame") get(x))
I also have a list of all ArticleId's occurring in the data frames:
AllArticleId, and a list of all ShopId's occurring in the data frames: AllShopId.
Can anyone tell me how I can restructure my data?
Perhaps you can try below:
library(dplyr);
file_names <- dir() # Location of individual sales files
agg_df <- do.call(rbind,lapply(file_names,read.csv))
# Median sale per Article ID
agg_df = agg_df %>% group_by(ArticleID) %>%
mutate(mSlByArtId = median(AmountSold));
# Median sale per Shop ID
agg_df = agg_df %>% group_by(ShopID) %>%
mutate(mSlByShpId = median(AmountSold));
Here is, for example, how you can perform a mean for given shop grouped by shopId:
dl <- list()
dl[[1]] <- data.frame(
shopId = rep(1:4, each = 2),
ArticleId = c(1, 1, 3, 2, 3, 2, 1, 2),
date = 1:8,
AmountSoled = 5
)
dl[[2]] <- data.frame(
shopId = rep(1:4, each = 2),
ArticleId = c(2, 1, 3, 2, 4, 4, 3, 1),
date = 1:8,
AmountSoled = 5
)
# dl
# [[1]]
# shopId ArticleId date AmountSoled
# 1 1 1 1 5
# 2 1 1 2 5
# 3 2 3 3 5
# 4 2 2 4 5
# 5 3 3 5 5
# 6 3 2 6 5
# 7 4 1 7 5
# 8 4 2 8 5
#
# [[2]]
# shopId ArticleId date AmountSoled
# 1 1 2 1 5
# 2 1 1 2 5
# 3 2 3 3 5
# 4 2 2 4 5
# 5 3 4 5 5
# 6 3 4 6 5
# 7 4 3 7 5
# 8 4 1 8 5
df <- do.call(rbind, dl)
df
# shopId ArticleId date AmountSoled
# 1 1 1 1 5
# 2 1 1 2 5
# 3 2 3 3 5
# 4 2 2 4 5
# 5 3 3 5 5
# 6 3 2 6 5
# 7 4 1 7 5
# 8 4 2 8 5
# 9 1 2 1 5
# 10 1 1 2 5
# 11 2 3 3 5
# 12 2 2 4 5
# 13 3 4 5 5
# 14 3 4 6 5
# 15 4 3 7 5
# 16 4 1 8 5
aggregate(df, by = list(df$shopId), mean)
# Group.1 shopId ArticleId date AmountSoled
# 1 1 1 1.25 1.5 5
# 2 2 2 2.50 3.5 5
# 3 3 3 3.25 5.5 5
# 4 4 4 1.75 7.5 5
The following code separately produces the group means of x and y in accordance to group. Suppose that I have a number of variables for which repeating the same operation.
How would you suggest to proceed in order to obtain the same result through a single command? (I suppose it is necessary to adopt tapply, but I am not really sure about it..).
x=seq(1,11,by=2); y=seq(2,12,by=2); group=rep(1:2, each=3)
dat <- data.frame(cbind(group, x, y))
dat$m_x <- ave(dat$x, dat$group)
dat$m_y <- ave(dat$y, dat$group)
dat
Many thanks.
Alternative solutions using data.table and plyr packages:
1) Using data.table
require(data.table)
dt <- data.table(dat, key="group")
# Following #Matthew's comment, edited:
dt[, `:=`(m_x = mean(x), m_y = mean(y)), by=group]
Output:
group x y m_x m_y
1: 1 1 2 3 4
2: 1 3 4 3 4
3: 1 5 6 3 4
4: 2 7 8 9 10
5: 2 9 10 9 10
6: 2 11 12 9 10
2) using plyr and transform:
require(plyr)
ddply(dat, .(group), transform, m_x=mean(x), m_y=mean(y))
output:
group x y m_x m_y
1 1 1 2 3 4
2 1 3 4 3 4
3 1 5 6 3 4
4 2 7 8 9 10
5 2 9 10 9 10
6 2 11 12 9 10
3) using plyr and numcolwise (note the reduced output):
ddply(dat, .(group), numcolwise(mean))
Output:
group x y
1 1 3 4
2 2 9 10
Assuming you have more than just two columns, you would want to use apply to apply ave to every column in the matrix.
x=seq(1,11,by=2); y=seq(2,12,by=2); group=rep(1:2, each=3)
dat <- cbind(x, y)
ave.dat <- apply(dat, 2, function(column) ave(column, group))
# x y
# [1,] 1 2
# [2,] 3 4
# [3,] 5 6
# [4,] 7 8
# [5,] 9 10
# [6,] 11 12
You can also use aggregate():
dat2 <- data.frame(dat, aggregate(dat[,-1], by=list(dat$group), mean)[group, -1])
dat2
group x y x.1 y.1
1 1 1 2 3 4
1.1 1 3 4 3 4
1.2 1 5 6 3 4
2 2 7 8 9 10
2.1 2 9 10 9 10
2.2 2 11 12 9 10
row.names(dat2) <- rownames(dat)
colnames(dat2) <- gsub("(.)\\.1", "m_\\1", colnames(dat2))
dat2
group x y m_x m_y
1 1 1 2 3 4
2 1 3 4 3 4
3 1 5 6 3 4
4 2 7 8 9 10
5 2 9 10 9 10
6 2 11 12 9 10
If the variable names are more than a single character, you would need to modify the gsub() call.