I was examining below code
library(dplyr)
DF = data.frame('A' = 1:3, 'B' =2:4)
Condition = 'A'
fn1 = function(x) x + 3
fn2 = function(x) x + 5
DF %>% mutate('aa' = 3:5) %>%
{if (Condition == 'A') {
bb = . %>% mutate('A1' = fn1(A), 'B1' = fn1(B))
bb
} else {
bb = . %>% mutate('A1' = fn2(A), 'B1' = fn2(B))
bb
}
}
Basically, I have 2 similar functions fn1 and fn2. Now based on some condition, I want to use one of these functions.
Above implementation is throwing below error -
Functional sequence with the following components:
1. mutate(., A1 = fn1(A), B1 = fn1(B))
Use 'functions' to extract the individual functions.
Can you please help be how to properly write the pipe sequence to execute above code?
We could use across within mutate
library(dplyr)
DF %>%
mutate(aa = 3:5, across(c(A, B), ~ if(Condition == 'A') fn1(.)
else fn2(.), .names = "{.col}1"))
-output
A B aa A1 B1
1 1 2 3 4 5
2 2 3 4 5 6
3 3 4 5 6 7
Also, an option is to get the functions in a list and convert the logical vector to numeric index for subsetting
DF %>%
mutate(aa = 3:5,
across(c(A, B), ~ list(fn2, fn1)[[1 + (Condition == 'A')]](.),
.names = "{.col}1"))
-output
A B aa A1 B1
1 1 2 3 4 5
2 2 3 4 5 6
3 3 4 5 6 7
Based on the comments, if we need a custom name for the new columns, create a named vector and replace with str_replace_all
library(stringr)
nm1 <- setNames(c("XXX", "YYY"), names(DF)[1:2])
DF %>%
mutate(aa = 3:5,
across(c(A, B), ~ list(fn2, fn1)[[1 + (Condition == 'A')]](.),
.names = "{str_replace_all(.col, nm1)}"))
A B aa XXX YYY
1 1 2 3 4 5
2 2 3 4 5 6
3 3 4 5 6 7
Related
I have a R DataFrame that has a structure similar to the following:
df <- data.frame(var1 = c(1, 1), var2 = c(0, 2), var3 = c(3, 0), f1 = c('a', 'b'), f2=c('c', 'd') )
So visually the DataFrame would look like
> df
var1 var2 var3 f1 f2
1 1 0 3 a c
2 1 2 0 b d
What I want to do is the following:
(1) Treat the first C=3 columns as counts for three different classes. (C is the number of classes, given as an input variable.) Add a new column called "class".
(2) For each row, duplicate the last two entries of the row according to the count of each class (separately); and append the class number to the new "class" column.
For example, the output for the above dataset would be
> df_updated
f1 f2 class
1 a c 1
2 a c 3
3 a c 3
4 a c 3
5 b d 1
6 b d 2
7 b d 2
where row (a c) is duplicated 4 times, 1 time with respect to class 1, and 3 times with respect to class 3; row (b d) is duplicated 3 times, 1 time with respect to class 1 and 2 times with respect to class 2.
I tried looking at previous posts on duplicating rows based on counts (e.g. this link), and I could not figure out how to adapt the solutions there to multiple count columns (and also appending another class column).
Also, my actual dataset has many more rows and classes (say 1000 rows and 20 classes), so ideally I want a solution that is as efficient as possible.
I wonder if anyone can help me on this. Thanks in advance.
Here is a tidyverse option. We can use uncount from tidyr to duplicate the rows according to the count in value (i.e., from the var columns) after pivoting to long format.
library(tidyverse)
df %>%
pivot_longer(starts_with("var"), names_to = "class") %>%
filter(value != 0) %>%
uncount(value) %>%
mutate(class = str_extract(class, "\\d+"))
Output
f1 f2 class
<chr> <chr> <chr>
1 a c 1
2 a c 3
3 a c 3
4 a c 3
5 b d 1
6 b d 2
7 b d 2
Another slight variation is to use expandrows from splitstackshape in conjunction with tidyverse.
library(splitstackshape)
df %>%
pivot_longer(starts_with("var"), names_to = "class") %>%
filter(value != 0) %>%
expandRows("value") %>%
mutate(class = str_extract(class, "\\d+"))
base R
Row order (and row names) notwithstanding:
tmp <- subset(reshape2::melt(df, id.vars = c("f1","f2"), value.name = "class"), class > 0, select = -variable)
tmp[rep(seq_along(tmp$class), times = tmp$class),]
# f1 f2 class
# 1 a c 1
# 2 b d 1
# 4 b d 2
# 4.1 b d 2
# 5 a c 3
# 5.1 a c 3
# 5.2 a c 3
dplyr
library(dplyr)
# library(tidyr) # pivot_longer
df %>%
pivot_longer(-c(f1, f2), values_to = "class") %>%
dplyr::filter(class > 0) %>%
select(-name) %>%
slice(rep(row_number(), times = class))
# # A tibble: 7 x 3
# f1 f2 class
# <chr> <chr> <dbl>
# 1 a c 1
# 2 a c 3
# 3 a c 3
# 4 a c 3
# 5 b d 1
# 6 b d 2
# 7 b d 2
I have a
df = data.frame(a = c(1,2,3), b = c(6,7,8))
I want to add two columns of the distance from mean of a:
a
b
diff_a
diff_b
1
4
-1
2
2
5
0
3
3
6
1
4
I don't want to write columns separately in mutate, as it will calculate mean multiple times(mean is example here, I actually have a functions takes a lot time). I want to use one function like
calculates <- function(a, b){
e_a <- mean(a)
return list(a - e_a, b - e_a)
}
We many need
library(dplyr)
df %>%
mutate(Meana = mean(a), across(a:b,
~ . - Meana, .names = "diff_{.col}"), Meana = NULL)
-output
a b diff_a diff_b
1 1 4 -1 2
2 2 5 0 3
3 3 6 1 4
data
df <- structure(list(a = c(1, 2, 3), b = c(4, 5, 6)),
class = "data.frame", row.names = c(NA,
-3L))
You may return a named list from the function and use cbind to add new columns to the dataframe.
df = data.frame(a = c(1,2,3), b = c(4,5,6))
calculates <- function(a, b){
e_a <- mean(a)
return(list(diff_a = a - e_a, diff_b = b - e_a))
}
cbind(df, calculates(df$a, df$b))
# a b diff_a diff_b
#1 1 4 -1 2
#2 2 5 0 3
#3 3 6 1 4
suppose I have a list of dataframes as follows:
df1 <- data.frame(a1 = 1:5, a2 = 1:5, a3 = 1:5)
df2 <- data.frame(a1 = 1:3, a2 = 2:4, a3 = 3:5)
df3 <- data.frame(a1 = 10:20, a2 = 5:15)
l <- list(df1 = df1, df2 = df2, df3 = df3)
What should I do to perform operations (like mutate) on each element on the list conditioning on the elements name?
For instance - how would I proceed If I wanted to add some new column only if was dealing with df1 or df3 and wanted to delete some column if I was dealing with df2?
Could map_if deal with that?
PS: Keep in mind that the list would probably have more than 3 datasets so that possibly multiple conditions would be needed.
You can do this sort of operations with imap instead. Since you would like to do a certain operations based on names of the your list or names of the elements of the list you should use imap.
.f argument in imap takes 2 arguments:
.x which is the first argument and represents the value
.y which is the second argument and represents the names of you arguments and in case they don't have names, it represents their positions
So for example in this case .xs are your 3 data sets and .ys are their names df1:df3 or their positions 1:3.
library(purrr)
l %>%
imap(~ if(.y %in% c("df1", "df3")) {
.x %>%
mutate(a3 = a1 + a2)
} else {
.x <- .x[-3]
.x
})
$df1
a1 a2 a3
1 1 1 2
2 2 2 4
3 3 3 6
4 4 4 8
5 5 5 10
$df2
a1 a2
1 1 2
2 2 3
3 3 4
$df3
a1 a2 a3
1 10 5 15
2 11 6 17
3 12 7 19
4 13 8 21
5 14 9 23
6 15 10 25
7 16 11 27
8 17 12 29
9 18 13 31
10 19 14 33
11 20 15 35
But if you would like to apply a certain function on each of your elements that meets a certain condition then you can use map_if. For example we would like to add a4 column if the number of rows in each are greater than a certain number. Bear in mind that .p argument should return a single TRUE or FALSE:
# This use case works
l %>%
map_if(~ nrow(.x) > 3, ~ .x %>%
mutate(a4 = a1 + a2))
# But this doesn't becase names(.x) are actually column names of each element and the result is not what you are after
l %>%
map_if(~ names(.x) %in% c("df1", "df3"), ~ .x %>%
mutate(a4 = a1 + a2))
An equivalent to imap is map2 where the second argument is the names of each element (and not the column names of each element):
l %>%
map2(names(l), ~ if(.y %in% c("df1", "df3")) {
.x %>%
mutate(a3 = a1 + a2)
} else {
.x <- .x[-3]
.x
})
I have a df where one variable is an integer. I'd like to split this column into it's individual digits. See my example below
Group Number
A 456
B 3
C 18
To
Group Number Digit1 Digit2 Digit3
A 456 4 5 6
B 3 3 NA NA
C 18 1 8 NA
We can use read.fwf from base R. Find the max number of character (nchar) in 'Number' column (mx). Read the 'Number' column after converting to character (as.character), specify the 'widths' as 1 by replicating 1 with mx and assign the output to new 'Digit' columns in the data
mx <- max(nchar(df1$Number))
df1[paste0("Digit", seq_len(mx))] <- read.fwf(textConnection(
as.character(df1$Number)), widths = rep(1, mx))
-output
df1
# Group Number Digit1 Digit2 Digit3
#1 A 456 4 5 6
#2 B 3 3 NA NA
#3 C 18 1 8 NA
data
df1 <- structure(list(Group = c("A", "B", "C"), Number = c(456L, 3L,
18L)), class = "data.frame", row.names = c(NA, -3L))
Another base R option (I think #akrun's approach using read.fwf is much simpler)
cbind(
df,
with(
df,
type.convert(
`colnames<-`(do.call(
rbind,
lapply(
strsplit(as.character(Number), ""),
`length<-`, max(nchar(Number))
)
), paste0("Digit", seq(max(nchar(Number))))),
as.is = TRUE
)
)
)
which gives
Group Number Digit1 Digit2 Digit3
1 A 456 4 5 6
2 B 3 3 NA NA
3 C 18 1 8 NA
Using splitstackshape::cSplit
splitstackshape::cSplit(df, 'Number', sep = '', stripWhite = FALSE, drop = FALSE)
# Group Number Number_1 Number_2 Number_3
#1: A 456 4 5 6
#2: B 3 3 NA NA
#3: C 18 1 8 NA
Updated
I realized I could use max function for counting characters limit in each row so that I could include it in my map2 function and save some lines of codes thanks to an accident that led to an inspiration by dear #ThomasIsCoding.
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
df %>%
rowwise() %>%
mutate(map2_dfc(Number, 1:max(nchar(Number)), ~ str_sub(.x, .y, .y))) %>%
unnest(cols = !c(Group, Number)) %>%
rename_with(~ str_replace(., "\\.\\.\\.", "Digit"), .cols = !c(Group, Number)) %>%
mutate(across(!c(Group, Number), as.numeric, na.rm = TRUE))
# A tibble: 3 x 5
Group Number Digit1 Digit2 Digit3
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 456 4 5 6
2 B 3 3 NA NA
3 C 18 1 8 NA
Data
df <- tribble(
~Group, ~Number,
"A", 456,
"B", 3,
"C", 18
)
Two base r methods:
no_cols <- max(nchar(as.character(df1$Number)))
# Using `strsplit()`:
cbind(df1, setNames(data.frame(do.call(rbind,
lapply(strsplit(as.character(df1$Number), ""),
function(x) {
length(x) <- no_cols
x
}
)
)
), paste0("Digit", seq_len(no_cols))))
# Using `regmatches()` and `gregexpr()`:
cbind(df1, setNames(data.frame(do.call(rbind,
lapply(regmatches(df1$Number, gregexpr("\\d", df1$Number)),
function(x) {
length(x) <- no_cols
x
}
)
)
), paste0("Digit", seq_len(no_cols))))
I would like to combine a set of data frames into a single data frame by summing columns that have matching variables (instead of appending columns).
For example, given
df1 <- data.frame(A = c(0,0,1,1,1,2,2), B = c(1,2,1,2,3,1,5), x = c(2,3,1,5,3,7,0))
df2 <- data.frame(A = c(0,1,1,2,2,2), B = c(1,1,3,2,4,5), x = c(4,8,4,1,0,3))
df3 <- data.frame(A = c(0,1,2), B = c(5,4,2), x = c(5,3,1))
I want to match by "A" and "B" and sum the values of "x". For this example, I can get the desired result as follows:
library(plyr)
library(dplyr)
# rename columns so that join_all preserves them all:
colnames(df1)[3] <- "x1"
colnames(df2)[3] <- "x2"
colnames(df3)[3] <- "x3"
# join the data frames by matching "A" and "B" values:
res <- join_all(list(df1, df2, df3), by = c("A", "B"), type = "full")
# get the sums and drop superfluous columns:
arrange(res, A, B) %>%
rowwise() %>%
mutate(x = sum(x1, x2, x3, na.rm = TRUE)) %>%
select(A, B, x)
Result:
A B x
<dbl> <dbl> <dbl>
1 0 1 6
2 0 2 3
3 0 5 5
4 1 1 9
5 1 2 5
6 1 3 7
7 1 4 3
8 2 1 7
9 2 2 2
10 2 4 0
11 2 5 3
A more general solution is
library(dplyr)
# function to get the desired result for two data frames:
my_merge <- function(df1, df2)
{
m1 <- merge(df1, df2, by = c("A", "B"), all = TRUE)
m1 <- rowwise(res) %>%
mutate(x = sum(x.x, x.y, na.rm = TRUE)) %>%
select(A, B, x)
return(m1)
}
l1 <- list(df2, df3) # omit the first data frame
res <- df1 # initial value of the result
for(df in l1) res <- my_merge(res, df) # call the function repeatedly
Is there a more efficient option for combining a large set of data frames? Ideally it should be recursive (i.e. it's better not to join all data frames into one massive data frame before calculating the sums).
An easier option is to bind the rows of the datasets, then group by the columns of interest and get the summarised output by getting the sum of 'x'
library(tidyverse)
bind_rows(df1, df2, df3) %>%
group_by(A, B) %>%
summarise(x = sum(x))
# A tibble: 11 x 3
# Groups: A [?]
# A B x
# <dbl> <dbl> <dbl>
# 1 0 1 6
# 2 0 2 3
# 3 0 5 5
# 4 1 1 9
# 5 1 2 5
# 6 1 3 7
# 7 1 4 3
# 8 2 1 7
# 9 2 2 2
#10 2 4 0
#11 2 5 3
If there are many objects in the global environment with the pattern "df" followed by some digits
mget(ls(pattern= "^df\\d+")) %>%
bind_rows %>%
group_by(A, B) %>%
summarise(x = sum(x))
As the OP mentioned about memory constraints, if we do the join first and then use rowSums or + with reduce, it would be more efficient
mget(ls(pattern= "^df\\d+")) %>%
reduce(full_join, by = c("A", "B")) %>%
transmute(A, B, x = rowSums(.[3:5], na.rm = TRUE)) %>%
arrange(A, B)
# A B x
#1 0 1 6
#2 0 2 3
#3 0 5 5
#4 1 1 9
#5 1 2 5
#6 1 3 7
#7 1 4 3
#8 2 1 7
#9 2 2 2
#10 2 4 0
#11 2 5 3
This could also be done with data.table
library(data.table)
rbindlist(mget(ls(pattern= "^df\\d+")))[, .(x = sum(x)), by = .(A, B)]
Ideally it should be recursive (i.e. it's better not to join all data frames into one massive data frame before calculating the sums).
If you're memory constrained and willing to sacrifice speed (vs #akrun's data.table approach), use one table at a time in a loop:
library(data.table)
tabs = c("df1", "df2", "df3")
# enumerate all combos for the results table
# initializing sum to 0
res = CJ(A = 0:2, B = 1:5, x = 0)
# loop over tabs, adding on
for (i in seq_along(tabs)){
tab = get(tabs[[i]])
res[tab, on=.(A, B), x := x + i.x][]
rm(tab)
}
If you need to read tables from disk, change tabs to file names and get to fread or whatever function.
I am skeptical that you can fit all the tables in memory, but cannot also fit an rbind-ed copy of them together.
Similarly (thanks to #akrun's comment), use his approach pairwise:
res = data.table(get(tabs[[1]]))[0L]
for (i in seq_along(tabs)){
tab = get(tabs[[i]])
res = rbind(res, tab)[, .(x = sum(x)), by=.(A,B)]
rm(tab)
}