This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 6 years ago.
I have trouble applying a split to a data.frame and then assembling some aggregated results back into a different data.frame. I tried using the 'unsplit' function but I can't figure out how to use it properly to get the desired result. Let me demonstrate on the common 'mtcars' data: Let's say that my ultimate result is to get a data frame with two variables: cyl (cylinders) and mean_mpg (mean over mpg for group of cars sharing the same count of cylinders).
So the initial split goes like this:
spl <- split(mtcars, mtcars$cyl)
The result of which looks something like this:
$`4`
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
...
$`6`
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
...
$`8`
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
...
Now I want to do something along the lines of:
df <- as.data.frame(lapply(spl, function(x) mean(x$mpg)), col.names=c("cyl", "mean_mpg"))
However, doing the above results in:
X4 X6 X8
1 26.66364 19.74286 15.1
While I'd want the df to be like this:
cyl mean_mpg
1 4 26.66364
2 6 19.74286
3 8 15.10000
Thanks, J.
If you are only interested in reassembling a split then look at (2), (4) and (4a) but if the actual underlying question is really about the way to perform aggregations over groups then they all may be of interest:
1) aggregate Normally one uses aggregate as already mentioned in the comments. Simplifying #alistaire's code slightly:
aggregate(mpg ~ cyl, mtcars, mean)
2) split/lapply/do.call Also #rawr has given a split/lapply/do.call solution in the comments which we can also simplify slightly:
spl <- split(mtcars, mtcars$cyl)
do.call("rbind", lapply(spl, with, data.frame(cyl = cyl[1], mpg = mean(mpg))))
3) do.call/by The last one could alternately be rewritten in terms of by:
do.call("rbind", by(mtcars, mtcars$cyl, with, data.frame(cyl = cyl[1], mpg = mean(mpg))))
4) split/lapply/unsplit Another possibility is to use split and unsplit:
spl <- split(mtcars, mtcars$cyl)
L <- lapply(spl, with, data.frame(cyl = cyl[1], mpg = mean(mpg), row.names = cyl[1]))
unsplit(L, sapply(L, "[[", "cyl"))
4a) or if row names are sufficient:
spl <- split(mtcars, mtcars$cyl)
L <- lapply(spl, with, data.frame(mpg = mean(mpg), row.names = cyl[1]))
unsplit(L, sapply(L, rownames))
The above do not use any packages but there are also many packages that can do aggregations including dplyr, data.table and sqldf:
5) dplyr
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarize(mpg = mean(mpg)) %>%
ungroup()
6) data.table
library(data.table)
as.data.table(mtcars)[, list(mpg = mean(mpg)), by = "cyl"]
7) sqldf
library(sqldf)
sqldf("select cyl, avg(mpg) mpg from mtcars group by cyl")
Related
I have a dataframe that must have a specific layout. Is there a way for me to make R reject any command I attempt that would change the number or names of the columns?
It is easy to check the format of the data table manually, but I have found no way to make R do it for me automatically every time I execute a piece of code.
regards
This doesn’t offer the level of foolproof safety I think you’re looking for (hard to know without more details), but you could define a function operator that yields modified functions that error if changes to columns are detected:
same_cols <- function(fn) {
function(.data, ...) {
out <- fn(.data, ...)
stopifnot(identical(sort(names(.data)), sort(names(out))))
out
}
}
For example, you could create modified versions of dplyr functions:
library(dplyr)
my_mutate <- same_cols(mutate)
my_summarize <- same_cols(summarize)
which work as usual if columns are preserved:
mtcars %>%
my_mutate(mpg = mpg / 2) %>%
head()
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 10.50 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 10.50 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 11.40 4 108 93 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 10.70 6 258 110 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 9.35 8 360 175 3.15 3.440 17.02 0 0 3 2
# Valiant 9.05 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars %>%
my_summarize(across(everything(), mean))
# mpg cyl disp hp drat wt qsec vs am
# 1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625
# gear carb
# 1 3.6875 2.8125
But throw errors if changes to columns are made:
mtcars %>%
my_mutate(mpg2 = mpg / 2)
# Error in my_mutate(., mpg2 = mpg/2) :
# identical(sort(names(.data)), sort(names(out))) is not TRUE
mtcars %>%
my_summarize(mpg = mean(mpg))
# Error in my_summarize(., mpg = mean(mpg)) :
# identical(sort(names(.data)), sort(names(out))) is not TRUE
You mention the names and columns need to be the same, also realize that with data.table also names are updated by reference. See the example below.
foo <- data.table(
x = letters[1:5],
y = LETTERS[1:5]
)
colnames <- names(foo)
colnames
# [1] "x" "y"
setnames(foo, colnames, c("a", "b"))
foo[, z := "oops"]
colnames
# [1] "a" "b" "z"
identical(colnames, names(foo))
# [1] TRUE
To check that both the columns and names are unalterated (and in same order here) you can take right away a copy of the names. And after each code run, you can check the current names with the copied names.
foo <- data.table(
x = letters[1:5],
y = LETTERS[1:5]
)
colnames <- copy(names(foo))
setnames(foo, colnames, c("a", "b"))
foo[, z := "oops"]
identical(colnames, names(foo))
[1] FALSE
colnames
# [1] "x" "y"
names(foo)
# [1] "a" "b" "z"
I need to rename multiple variables using a replacement dataframe. This replacement dataframe also includes regex. I would like to use a similar solution proposed here, .e.g
df %>% rename_with(~ newnames, all_of(oldnames))
MWE:
df <- mtcars[, 1:5]
# works without regex
replace_df_1 <- tibble::tibble(
old = df %>% colnames(),
new = df %>% colnames() %>% toupper()
)
df %>% rename_with(~ replace_df_1$new, all_of(replace_df_1$old))
# with regex
replace_df_2 <- tibble::tibble(
old = c("^m", "cyl101|cyl", "disp", "hp", "drat"),
new = df %>% colnames() %>% toupper()
)
old new
<chr> <chr>
1 ^m MPG
2 cyl101|cyl CYL
3 disp DISP
4 hp HP
5 drat DRAT
# does not work
df %>% rename_with(~ replace_df_2$new, all_of(replace_df_2$old))
df %>% rename_with(~ matches(replace_df_2$new), all_of(replace_df_2$old))
EDIT 1:
The solution of #Mael works in general, but there seems to be index issue, e.g. consider the following example
replace_df_2 <- tibble::tibble(
old = c("xxxx", "cyl101|cyl", "yyy", "xxx", "yyy"),
new = mtcars[,1:5] %>% colnames() %>% toupper()
)
mtcars[, 1:5] %>%
rename_with(~ replace_df_2$new, matches(replace_df_2$old))
Results in
mpg MPG disp hp drat
<dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9
meaning that the rename_with function correctly finds the column, but replaces it with the first item in the replacement column. How can we tell the function to take the respective row where a replacement has been found?
So in this example (edit 1), I only want to substitute the second column with "CYL", the rest should be left untouched. The problem is that the function takes the first replacement (MPG) instead of the second (CYL).
Thank you for any hints!
matches should be on the regex-y column:
df %>%
rename_with(~ replace_df_2$new, matches(replace_df_2$old))
MPG CYL DISP HP DRAT
Mazda RX4 21.0 6 160.0 110 3.90
Mazda RX4 Wag 21.0 6 160.0 110 3.90
Datsun 710 22.8 4 108.0 93 3.85
Hornet 4 Drive 21.4 6 258.0 110 3.08
Hornet Sportabout 18.7 8 360.0 175 3.15
Valiant 18.1 6 225.0 105 2.76
#...
If the task is simply to set all col names to upper-case, then this works:
sub("^(.+)$", "\\U\\1", colnames(df), perl = TRUE)
[1] "MPG" "CYL" "DISP" "HP" "DRAT"
In dplyr:
df %>%
rename_with( ~sub("^(.+)$", "\\U\\1", colnames(df), perl = TRUE))
I found a solution using the idea of non standard evaluation from this question and #Maël's answer.
Using map_lgl we create a logical vector that returns TRUE if the column in replace_df_2$old can be found inside the dataframe df. Then we pass this logical vector to replace_df_2$new to get the correct replacement.
df <- mtcars[, 1:5]
df %>%
rename_with(.fn = ~replace_df_2$new[map_lgl(replace_df_2$old,~ any(str_detect(., names(df))))],
.cols = matches(replace_df_2$old))
Result:
mpg CYL disp hp drat
Mazda RX4 21.0 6 160.0 110 3.90
Suppose I am working with the mtcars dataset, and I would like to add:
1 to all values in the column: mpg
2 to all values in the column: cyl
3 to all values in the column: disp
I would like to keep all columns in mtcars, and refer to the columns by their names rather than their index.
Here's my current attempt:
library("tidyverse")
library("rlang")
data(mtcars)
mtcars_colnames <- quo(c("mpg", "cyl", "disp"))
num <- c(1, 2, 3)
mtcars %>% mutate(across(!!! mtcars_colnames, function(x) {x + num[col(.)]}))
I'm stuck on how to dynamically add (1,2,3) to columns (mpg, cyl, disp).
Thanks in advance.
We could change the input by passing just a vector of strings instead of quosures and a named vector for 'num', then use the cur_column inside the across to match with the named vector of 'num', get the corresponding value and do the addition
library(dplyr)
mtcars_colnames <- c("mpg", "cyl", "disp")
num <- setNames(c(1, 2, 3), mtcars_colnames)
mtcars1 <- mtcars %>%
mutate(across(all_of(mtcars_colnames), ~ num[cur_column()] + .))
-check the output
# // old data
mtcars %>%
select(all_of(mtcars_colnames)) %>%
slice_head(n = 5)
# mpg cyl disp
#Mazda RX4 21.0 6 160
#Mazda RX4 Wag 21.0 6 160
#Datsun 710 22.8 4 108
#Hornet 4 Drive 21.4 6 258
#Hornet Sportabout 18.7 8 360
# // new data
mtcars1 %>%
select(all_of(mtcars_colnames)) %>%
slice_head(n = 5)
# mpg cyl disp
#Mazda RX4 22.0 8 163
#Mazda RX4 Wag 22.0 8 163
#Datsun 710 23.8 6 111
#Hornet 4 Drive 22.4 8 261
#Hornet Sportabout 19.7 10 363
Or if we prefer to pass a unnamed 'num' vector, then match the cur_column with the 'mtcars_colnamesinside theacross` to return the index and then use that to subset the 'num'
mtcars1 <- mtcars %>%
mutate(across(all_of(mtcars_colnames),
~ num[match(cur_column(), mtcars_colnames)] + .))
Here are 3 base R approaches :
mtcars_colnames <- c("mpg", "cyl", "disp")
num <- c(1, 2, 3)
df <- mtcars
#option 1
df[mtcars_colnames] <- sweep(df[mtcars_colnames], 2, num, `+`)
#option 2
df[mtcars_colnames] <- Map(`+`, df[mtcars_colnames], num)
#option 3
df[mtcars_colnames] <- t(t(df[mtcars_colnames]) + num)
Is it possible to use the pipe Operator in R (not to get) but to set data?
Lets say i want to modify the first row of mtcars dataset and set the value of qsec to 99.
Traditional way:
mtcars[1, 7] <- 99
Is that also possible using the pipe Operator?
mtcars %>% filter(qsec == 16.46) %>% select(qsec) <- 99
If we are in a state where the chain is absolute necessary or curious to know whether <- can be applied in a chain
library(magrittr)
mtcars %>%
`[<-`(1, 7, 99) %>%
head(2)
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21 6 160 110 3.9 2.620 99.00 0 1 4 4
#Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
Also, inset (from the comments) is an alias for [<-
mtcars %>%
inset(1, 7, 99) %>%
head(2)
I have several similar data frames with many columns in common. I would like to select and rename a subset of those columns from any table.
library(tidyverse)
mtcars %>%
select(my_mpg = mpg,
cylinders = cyl,
gear)
Is it possible to do something like
my_select_rename <- c("my_mpg"="mpg","cylinders"="cyl","gear")
mtcars %>%
select_(.dots = my_select_rename)
but using the tidyeval framework instead?
I think you want:
my_select <- c("mpg","cyl","gear")
my_select_rename <- c("my_mpg","cylinders","gear")
mtcars %>%
select_at(vars(my_select)) %>%
setNames(., my_select_rename)
my_mpg cylinders gear
Mazda RX4 21.0 6 4
Mazda RX4 Wag 21.0 6 4
Datsun 710 22.8 4 4
Hornet 4 Drive 21.4 6 3
Hornet Sportabout 18.7 8 3
lionel's answer to this question group_by by a vector of characters using tidy evaluation semantics provides the answer
mtcars %>%
select(!!! rlang::syms(my_select_rename))