An example:
a = c(10,20,30)
b = c(1,2,3)
c = c(4,5,6)
d = c(7,8,9)
df=data.frame(a,b,c,d)
library(dplyr)
df_1 = df %>% mutate(a1=sum(a+1))
How do I add "a1" after "a" (or any other defined position) and NOT at the end?
Thank you.
An update that might be useful for others who find this question - this can now be achieved directly within mutate (I'm using dplyr v1.0.2).
Just specify which existing column the new column should be positioned after or before, e.g.:
df_after <- df %>%
mutate(a1=sum(a+1), .after = a)
df_before <- df %>%
mutate(a1=sum(a+1), .before = b)
Another option is add_column from tibble
library(tibble)
add_column(df, a1 = sum(a + 1), .after = "a")
# a a1 b c d
#1 10 63 1 4 7
#2 20 63 2 5 8
#3 30 63 3 6 9
Extending on www's answer, we can use dplyr's select_helper functions to reorder newly created columns as we see fit:
library(dplyr)
## add a1 after a
df %>%
mutate(a1 = sum(a + 1)) %>%
select(a, a1, everything())
#> a a1 b c d
#> 1 10 63 1 4 7
#> 2 20 63 2 5 8
#> 3 30 63 3 6 9
## add a1 after c
df %>%
mutate(a1 = sum(a + 1)) %>%
select(1:c, a1, everything())
#> a b c a1 d
#> 1 10 1 4 63 7
#> 2 20 2 5 63 8
#> 3 30 3 6 63 9
dplyr >= 1.0.0
relocate was added as a new verb to change the order of one or more columns. If you pipe the output of your mutate the syntax for relocate also uses .before and .after arguments:
df_1 %>%
relocate(a1, .after = a)
a a1 b c d
1 10 63 1 4 7
2 20 63 2 5 8
3 30 63 3 6 9
An additional benefit is you can also move multiple columns using any tidyselect syntax:
df_1 %>%
relocate(c:a1, .before = b)
a c d a1 b
1 10 4 7 63 1
2 20 5 8 63 2
3 30 6 9 63 3
The mutate function will always add the newly created column at the end. However, we can sort the column alphabetically after the mutate function using select.
library(dplyr)
df_1 <- df %>%
mutate(a1 = sum(a + 1)) %>%
select(sort(names(.)))
df_1
# a a1 b c d
# 1 10 63 1 4 7
# 2 20 63 2 5 8
# 3 30 63 3 6 9
Related
I looked for solutions here: Multiply columns in a data frame by a vector and here: What is the right way to multiply data frame by vector?, but it doesn't really work.
What I want to do is a more or less clean tidyverse way where I multiply columns by a vector and then add these as new columns to the existing data frame. Taking teh data example from the first link:
c1 <- c(1,2,3)
c2 <- c(4,5,6)
c3 <- c(7,8,9)
d1 <- data.frame(c1,c2,c3)
c1 c2 c3
1 1 4 7
2 2 5 8
3 3 6 9
v1 <- c(1,2,3)
my desired result would be:
c1 c2 c3 pro_c1 pro_c2 pro_c3
1 1 4 7 1 8 21
2 2 5 8 2 10 24
3 3 6 9 3 12 27
I tried:
library(tidyverse)
d1 |>
mutate(pro = sweep(across(everything()), 2, v1, "*"))
But here the problem is the new columns are actually a data frame within my data frame. And I'm struggling with turning this data frame-in-data frame into regular columns. I assume, I could probably first setNames on this inner data frame and then unnest, but wondering if there's a more direct way by looping over each column with across and feed it with the first/second/third element of v1?
(I know I could probably also first create a standalone data frame with the three new multiplied columns, give them a unique name and then bind_cols on both, d1 and the df with the products.)
This is perhaps ridiculous, but you could use
library(dplyr)
d1 %>%
mutate(across(everything(),
~.x * v1[which(names(d1) == cur_column())],
.names = "pro_{.col}"))
which returns
c1 c2 c3 pro_c1 pro_c2 pro_c3
1 1 4 7 1 8 21
2 2 5 8 2 10 24
3 3 6 9 3 12 27
Just for the fun part, I trialed & errored a bit more after seeing some of your solutions. Since I started treating myself to the pain of using the base R native pipe which doesn't yet allow for passing a "." argument silently as the first argument, I had to fiddle around with it a bit more:
library(tidyverse)
d1 |>
(\(x)(bind_cols(x, x |>
map2_dfc(v1, `*`) |>
rename_with(.cols = everything(),
.fn = ~paste0("pro_", .)))))()
c1 c2 c3 pro_c1 pro_c2 pro_c3
1 1 4 7 1 8 21
2 2 5 8 2 10 24
3 3 6 9 3 12 27
Found an even easier solution:
d1 |>
add_column(d1 |>
map2_dfc(v1, `*`) |>
rename_with(.cols = everything(),
.fn = ~paste0("pct_", .)))
If it is by row, then one option is c_across
library(dplyr)
library(stringr)
library(tibble)
new <- as_tibble(setNames(as.list(v1), names(d1)))
d1 %>%
rowwise %>%
mutate(c_across(everything()) * new) %>%
rename_with(~ str_c("pro_", .x), everything()) %>%
bind_cols(d1, .)
-output
1 c2 c3 pro_c1 pro_c2 pro_c3
1 1 4 7 1 8 21
2 2 5 8 2 10 24
3 3 6 9 3 12 27
Or another option is map2
library(purrr)
map2_dfc(d1, v1, `*`) %>%
rename_with(~ str_c("pro_", .x), everything()) %>%
bind_cols(d1, .)
-output
c1 c2 c3 pro_c1 pro_c2 pro_c3
1 1 4 7 1 8 21
2 2 5 8 2 10 24
3 3 6 9 3 12 27
Also, with the OP' approach, it is a data.frame column. It can be unpacked
library(tidyr)
d1 |>
mutate(pro = sweep(cur_data(), 2, v1, `*`)) |>
unpack(pro, names_sep = "_")
-output
# A tibble: 3 × 6
c1 c2 c3 pro_c1 pro_c2 pro_c3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 7 1 8 21
2 2 5 8 2 10 24
3 3 6 9 3 12 27
EDIT: Based on #deschen comments with names_sep
Here is a dplyr-ized version of the usual apply(. , 1, fun) paradigm:
d1 %>% apply(1, "*", v1) %>% t %>% cbind(d1, .)
c1 c2 c3 c1 c2 c3
1 1 4 7 1 8 21
2 2 5 8 2 10 24
3 3 6 9 3 12 27
It gets a bit hackish if you want to assign column names to the matrix before binding back to the starting dataframe:
d1 %>% apply(1, "*", v1) %>% t %>% `colnames<-`(., paste0("pro_", colnames(.))) %>% cbind(d1, .)
c1 c2 c3 pro_c1 pro_c2 pro_c3
1 1 4 7 1 8 21
2 2 5 8 2 10 24
3 3 6 9 3 12 27
Similar to #IRTFM's solution, but does not need apply(...)
cbind(d1, t(t(d1)*v1))
## c1 c2 c3 c1 c2 c3
## 1 1 4 7 1 8 21
## 2 2 5 8 2 10 24
## 3 3 6 9 3 12 27
Or,
result <- cbind(d1, t(t(d1)*v1))
colnames(result) <- c(colnames(d1), paste0('pro_', colnames(d1)))
result
which gives the column names you want.
Let's say I have a dataframe of Name and value, is there any ways to extract BOTH minimum and maximum values within Name in a single function?
set.seed(1)
df <- tibble(Name = rep(LETTERS[1:3], each = 3), Value = sample(1:100, 9))
# A tibble: 9 x 2
Name Value
<chr> <int>
1 A 27
2 A 37
3 A 57
4 B 89
5 B 20
6 B 86
7 C 97
8 C 62
9 C 58
The output should contains TWO columns only (Name and Value).
Thanks in advance!
You can use range to get max and min value and use it in summarise to get different rows for each Name.
library(dplyr)
df %>%
group_by(Name) %>%
summarise(Value = range(Value), .groups = "drop")
# Name Value
# <chr> <int>
#1 A 27
#2 A 57
#3 B 20
#4 B 89
#5 C 58
#6 C 97
If you have large dataset using data.table might be faster.
library(data.table)
setDT(df)[, .(Value = range(Value)), Name]
You can use dplyr::group_by() and dplyr::summarise() like this:
library(dplyr)
set.seed(1)
df <- tibble(Name = rep(LETTERS[1:3], each = 3), Value = sample(1:100, 9))
df %>%
group_by(Name) %>%
summarise(
maximum = max(Value),
minimum = min(Value)
)
This outputs:
# A tibble: 3 × 3
Name maximum minimum
<chr> <int> <int>
1 A 68 1
2 B 87 34
3 C 82 14
What's a little odd is that my original df object looks a little different than yours, in spite of the seed:
# A tibble: 9 × 2
Name Value
<chr> <int>
1 A 68
2 A 39
3 A 1
4 B 34
5 B 87
6 B 43
7 C 14
8 C 82
9 C 59
I'm currently using rbind() together with slice_min() and slice_max(), but I think it may not be the best way or the most efficient way when the dataframe contains millions of rows.
library(tidyverse)
rbind(df %>% group_by(Name) %>% slice_max(Value),
df %>% group_by(Name) %>% slice_min(Value)) %>%
arrange(Name)
# A tibble: 6 x 2
# Groups: Name [3]
Name Value
<chr> <int>
1 A 57
2 A 27
3 B 89
4 B 20
5 C 97
6 C 58
In base R, the output format can be created with tapply/stack - do a group by tapply to get the output as a named list or range, stack it to two column data.frame and change the column names if needed
setNames(stack(with(df, tapply(Value, Name, FUN = range)))[2:1], names(df))
Name Value
1 A 27
2 A 57
3 B 20
4 B 89
5 C 58
6 C 97
Using aggregate.
aggregate(Value ~ Name, df, range)
# Name Value.1 Value.2
# 1 A 1 68
# 2 B 34 87
# 3 C 14 82
Let's say I have a data frame. I would like to mutate new columns by subtracting each pair of the existing columns. There are rules in the matching columns. For example, in the below codes, the prefix is all same for the first component (base_g00) of the subtraction and the same for the second component (allow_m00). Also, the first component has numbers from 27 to 43 for the id and the second component's id is from 20 to 36 also can be interpreted as (1st_id-7). I am wondering for the following code, can I write in a apply function or loops within mutate format to make the codes simpler. Thanks so much for any suggestions in advance!
pred_error<-y07_13%>%mutate(annual_util_1=base_g0027-allow_m0020,
annual_util_2=base_g0028-allow_m0021,
annual_util_3=base_g0029-allow_m0022,
annual_util_4=base_g0030-allow_m0023,
annual_util_5=base_g0031-allow_m0024,
annual_util_6=base_g0032-allow_m0025,
annual_util_7=base_g0033-allow_m0026,
annual_util_8=base_g0034-allow_m0027,
annual_util_9=base_g0035-allow_m0028,
annual_util_10=base_g0036-allow_m0029,
annual_util_11=base_g0037-allow_m0030,
annual_util_12=base_g0038-allow_m0031,
annual_util_13=base_g0039-allow_m0032,
annual_util_14=base_g0040-allow_m0033,
annual_util_15=base_g0041-allow_m0034,
annual_util_16=base_g0042-allow_m0035,
annual_util_17=base_g0043-allow_m0036)
I think a more idiomatic tidyverse approach would be to reshape your data so those column groups are encoded as a variable instead of as separate columns which have the same semantic meaning.
For instance,
library(dplyr); library(tidyr); library(stringr)
y07_13 <- tibble(allow_m0021 = 1:5,
allow_m0022 = 2:6,
allow_m0023 = 11:15,
base_g0028 = 5,
base_g0029 = 3:7,
base_g0030 = 100)
y07_13 %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
mutate(type = str_extract(name, "allow_m|base_g"),
num = str_remove(name, type) %>% as.numeric(),
group = num - if_else(type == "allow_m", 20, 27)) %>%
select(row, type, group, value) %>%
pivot_wider(names_from = type, values_from = value) %>%
mutate(annual_util = base_g - allow_m)
Result
# A tibble: 15 x 5
row group allow_m base_g annual_util
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 5 4
2 1 2 2 3 1
3 1 3 11 100 89
4 2 1 2 5 3
5 2 2 3 4 1
6 2 3 12 100 88
7 3 1 3 5 2
8 3 2 4 5 1
9 3 3 13 100 87
10 4 1 4 5 1
11 4 2 5 6 1
12 4 3 14 100 86
13 5 1 5 5 0
14 5 2 6 7 1
15 5 3 15 100 85
Here is vectorised base R approach -
base_cols <- paste0("base_g00", 27:43)
allow_cols <- paste0("allow_m00", 20:36)
new_cols <- paste0("annual_util", 1:17)
y07_13[new_cols] <- y07_13[base_cols] - y07_13[allow_cols]
y07_13
An example data.frame:
library(tidyverse)
example <- data.frame(matrix(sample.int(15),5,3),
sample(c("A","B","C"),5,replace=TRUE) ) %>%
`colnames<-`( c("A","B","C","choose") ) %>% print()
Output:
A B C choose
1 9 12 4 A
2 7 8 13 C
3 5 1 2 A
4 15 3 11 C
5 14 6 10 B
The column "choose" indicates which value should be selected from the columns A,B,C
My humble solution for the column "result" :
cols <- c(A=1,B=2,C=3)
col_index <- cols[example$choose]
xy <- cbind(1:nrow(example),col_index)
example %>% mutate(result = example[xy])
Output:
A B C choose result
1 9 12 4 A 9
2 7 8 13 C 13
3 5 1 2 A 5
4 15 3 11 C 11
5 14 6 10 B 6
I'am sure there is a more elegant solution with dplyr,
but my attemps with "rowwise" or "accross" failed.
Is it possible to get here a one-line-solution?
The efficient option is to make use of row/column indexing
example$result <- example[1:3][cbind(seq_len(nrow(example)),
match(example$choose, names(example)))]
with dplyr, we may use get with rowwise
library(dplyr)
example %>%
rowwise %>%
mutate(result = get(choose)) %>%
ungroup
Or instead of get use cur_data()
example %>%
rowwise %>%
mutate(result = cur_data()[[choose]]) %>%
ungroup
Or the vectorized option with row/column indexing
example %>%
mutate(result = select(., where(is.numeric))[cbind(row_number(),
match(choose, names(example)))])
Here is an alternative way:
library(dplyr)
library(tidyr)
example %>%
pivot_longer(
-choose,
) %>%
filter(choose == name) %>%
select(result=value) %>%
bind_cols(example)
result A B C choose
<int> <int> <int> <int> <chr>
1 9 6 9 1 B
2 14 5 2 14 C
3 7 8 7 3 B
4 15 15 4 12 A
5 11 13 10 11 C
I am trying to figure out how to sum values belonging to category a and b by factor file, but also keep the original data.
library(dplyr)
df <- data.frame(ID = 1:20, values = runif(20), category = rep(letters[1:5], 4), file = as.factor(sort(rep(1:5, 4))))
ID values category file
1 1 0.65699229 a 1
2 2 0.70506478 b 1
3 3 0.45774178 c 1
4 4 0.71911225 d 1
5 5 0.93467225 e 1
6 6 0.25542882 a 2
7 7 0.46229282 b 2
8 8 0.94001452 c 2
9 9 0.97822643 d 2
10 10 0.11748736 e 2
11 11 0.47499708 a 3
12 12 0.56033275 b 3
13 13 0.90403139 c 3
14 14 0.13871017 d 3
15 15 0.98889173 e 3
16 16 0.94666823 a 4
17 17 0.08243756 b 4
18 18 0.51421178 c 4
19 19 0.39020347 d 4
20 20 0.90573813 e 4
so that
df[1,2] will be added to df[2,2] to category 'ab' for file 1
df[6,2] will be added to df[7,2] to category 'ab' for file 2
etc.
So far I have this:
df %>%
filter(category %in% c('a' , 'b')) %>%
group_by(file) %>%
summarise(values = sum(values))
Problem
I would like to change the category of the summed values to "ab" and append it to the original data frame in the same pipeline.
Desired output:
ID values category file
1 1 0.65699229 a 1
2 2 0.70506478 b 1
3 3 0.45774178 c 1
4 4 0.71911225 d 1
5 5 0.93467225 e 1
6 6 0.25542882 a 2
7 7 0.46229282 b 2
8 8 0.94001452 c 2
9 9 0.97822643 d 2
10 10 0.11748736 e 2
11 11 0.47499708 a 3
12 12 0.56033275 b 3
13 13 0.90403139 c 3
14 14 0.13871017 d 3
15 15 0.98889173 e 3
16 16 0.94666823 a 4
17 17 0.08243756 b 4
18 18 0.51421178 c 4
19 19 0.39020347 d 4
20 20 0.90573813 e 4
21 21 1.25486225 ab 1
22 22 1.87216325 ab 2
23 23 1.36548126 ab 3
This will get you the result
df %>% bind_rows(
df %>%
filter(category %in% c('a' , 'b')) %>%
group_by(file) %>%
mutate(values = sum(values), category = paste0(category,collapse='')) %>%
filter(row_number() == 1 & n() > 1)
) %>% mutate(ID = row_number())
BTW the code pro produce the dataframe in the example is this one:
df <- data.frame(ID = 1:20, values = runif(20), category = rep(letters[1:5], 4), file = as.factor(sort(rep(1:4, 5))))
now lets say you want to sum multiple columns, you need to provide the list in a vector:
cols = c("values") # columns to be sum
df %>% bind_rows(
df %>%
filter(category %in% c('a' , 'b')) %>%
group_by(file) %>%
mutate_at(vars(cols), sum) %>%
mutate(category = paste0(category,collapse='')) %>%
filter(row_number() == 1 & n() > 1)
) %>% mutate(ID = row_number())
library(dplyr)
df1 %>%
filter(category %in% c('a' , 'b')) %>%
group_by(file) %>%
filter(n_distinct(category) > 1) %>%
summarise(values = sum(values)) %>%
mutate(category="ab",
ID=max(df1$ID)+1:n()) %>%
bind_rows(df1, .)
#> Warning in bind_rows_(x, .id): binding factor and character vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> ID values category file
#> 1 1 0.62585921 a 1
#> 2 2 0.61865851 b 1
#> 3 3 0.05274456 c 1
#> 4 4 0.68156961 d 1
.
.
.
#> 19 19 0.43239411 d 5
#> 20 20 0.85886314 e 5
#> 21 21 1.24451773 ab 1
#> 22 22 0.99001810 ab 2
#> 23 23 1.25331943 ab 3
This data.table approach uses a self-join to get all of the possible two-character combinations.
library(data.table)
setDT(df)
df_self_join <- df[df, on = .(file), allow.cartesian = T
][category != i.category,
.(category = paste0(i.category, category), values = values + i.values, file)
][order(category), .(ID = .I + nrow(df), values, category, file)]
rbindlist(list(df, df_self_join))
ID values category file
1: 1 0.76984382 a 1
2: 2 0.54311583 b 1
3: 3 0.23462016 c 1
4: 4 0.60179043 d 1
...
20: 20 0.03534223 e 5
21: 21 1.31295965 ab 1
22: 22 0.51666175 ab 2
23: 23 1.02305754 ab 3
24: 24 1.00446399 ac 1
25: 25 0.96910373 ac 2
26: 26 0.87795389 ac 4
#total of 80 rows
Here is pretty close dplyr translation:
library(dplyr)
tib <- as_tibble(df)
inner_join(tib, tib, by = 'file')%>%
filter(ID.x != ID.y)%>%
transmute(category = paste0(category.x, category.y)
, values = values.x + values.y
, file)%>%
arrange(category)%>%
bind_rows(tib, .)%>%
mutate(ID = row_number())%>%
filter(category == 'ab') #filter added to show the "ab" files
# A tibble: 3 x 4
ID values category file
<int> <dbl> <chr> <fct>
1 21 1.31 ab 1
2 22 0.517 ab 2
3 23 1.02 ab 3