Calculating a function of multiple variables for a dataframe in wide format is very familiar:
library(tidyverse)
df <- tibble(t = 1:3, b = 11:13, c = 21:23)
df <- df %>% mutate(d = b + c) # or base R: df$d <- df$b + df$c
What about when the dataframe is in long format? e.g.
df <- df %>% pivot_longer(-t, names_to = "variable", values_to = "value")
In this long format, you could imagine the same operation working by first group_by(t), and then calculating one value of d for each group, namely that group's variable=b value plus that group's variable=c value. Is this possible? One might think of something like summarise(d = b + c) but that expects wide format.
NB my real-world example has more than two cols b and c and I want to put them into a defined function, not just add them. My working solution is pivoting a huge dataframe from long to wide, calling my multivariable function to define a new column, then pivoting back to long.
Edit: to make the real world example explicit, I need to call a defined function that treats its arguments differently, unlike sum. For example
my.func <- function(b, c) { b^c }
How could the variable d be calculated by applying this function to the values of b and c associated with the same value of t?
We can just do sum instead of +
library(dplyr)
library(tidyr)
df %>%
group_by(t) %>%
summarise(d =sum(value[variable %in% c('b', 'c')]))
If it is to apply the my.func, we need to extract the value that correspond to 'b', 'c'
df %>%
group_by(t) %>%
mutate(new = my.func(value[variable == 'b'], value[variable == 'c']))
Related
My question seems simple, but I just can't do it. I have a dataframe with multiple columns with the name starting with coa and another column p with values like A, D, F, and so on, which changes according to the id.
All I found is how to do this matching with a fixed value, let's say "A", as below:
df <-df %>%
mutate(ly = any(str_detect(c_across(starts_with("coa")), "A")))
However, in my case, I want to compare to the column p specifically, where p changes, something like this:
df <-df %>%
mutate(ly = any(str_detect(c_across(starts_with("coa")), p)))
In this case, I get the error:
x no applicable method for 'type' applied to an object of class "factor"
Any thoughts? Thanks!
If we need to create a column, use if_any
library(dplyr)
library(stringr)
df <- df %>%
mutate(ly = if_any(starts_with("coa"), ~ str_detect(.x, p)))
I think this is a good place to use dplyr::across. You can run vignette('colwise') for a more comprehensive guide, but the key point here is that we can mutate all columns starting with "coa" simultaneously using the function == and we can pass a second argument, p, to == using the ... option provided by across.
library(dplyr)
df <- tibble(p = 1:10, coa1 = 1:10, coa2 = 11:20)
df %>%
mutate(across(.cols = starts_with('coa'), .fns = `==`, p))
Given
base <- data.frame( a = 1)
f <- function() c(2,3,4)
I am looking for a solution that would result in a function f being applied to each row of base data frame and the result would be appended to each row. Neither of the following works:
result <- base %>% rowwise() %>% mutate( c(b,c,d) = f() )
result <- base %>% rowwise() %>% mutate( (b,c,d) = f() )
result <- base %>% rowwise() %>% mutate( b,c,d = f() )
What is the correct syntax for this task?
This appears to be a similar problem (Assign multiple new variables on LHS in a single line in R) but I am specifically interested in solving this with functions from tidyverse.
I think the best you are going to do is a do() to modify the data.frame. Perhaps
base %>% do(cbind(., setNames(as.list(f()), c("b","c","d"))))
would probably be best if f() returned a list in the first place for the different columns.
In case you're willing to do this without dplyr:
# starting data frame
base_frame <- data.frame(col_a = 1:10, col_b = 10:19)
# the function you want applied to a given column
add_to <- function(x) { x + 100 }
# run this function on your base data frame, specifying the column you want to apply the function to:
add_computed_col <- function(frame, funct, col_choice) {
frame[paste(floor(runif(1, min=0, max=10000)))] = lapply(frame[col_choice], funct)
return(frame)
}
Usage:
df <- add_computed_col(base_frame, add_to, 'col_a')
head(df)
And add as many columns as needed:
df_b <- add_computed_col(df, add_to, 'col_b')
head(df_b)
Rename your columns.
I am trying to use pipe mutate statement using a custom function. I looked a this somewhat similar SO post but in vain.
Say I have a data frame like this (where blob is some variable not related to the specific task but is part of the entire data) :
df <-
data.frame(exclude=c('B','B','D'),
B=c(1,0,0),
C=c(3,4,9),
D=c(1,1,0),
blob=c('fd', 'fs', 'sa'),
stringsAsFactors = F)
I have a function that uses the variable names so select some based on the value in the exclude column and e.g. calculates a sum on the variables not specified in exclude (which is always a single character).
FUN <- function(df){
sum(df[c('B', 'C', 'D')] [!names(df[c('B', 'C', 'D')]) %in% df['exclude']] )
}
When I gives a single row (row 1) to FUN I get the the expected sum of C and D (those not mentioned by exclude), namely 4:
FUN(df[1,])
How do I do similarly in a pipe with mutate (adding the result to a variable s). These two tries do not work:
df %>% mutate(s=FUN(.))
df %>% group_by(1:n()) %>% mutate(s=FUN(.))
UPDATE
This also do not work as intended:
df %>% rowwise(.) %>% mutate(s=FUN(.))
This works of cause but is not within dplyr's mutate (and pipes):
df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))
If you want to use dplyr you can do so using rowwise and your function FUN.
df %>%
rowwise %>%
do({
result = as_data_frame(.)
result$s = FUN(result)
result
})
The same can be achieved using group_by instead of rowwise (like you already tried) but with do instead of mutate
df %>%
group_by(1:n()) %>%
do({
result = as_data_frame(.)
result$s = FUN(result)
result
})
The reason mutate doesn't work in this case, is that you are passing the whole tibble to it, so it's like calling FUN(df).
A much more efficient way of doing the same thing though is to just make a matrix of columns to be included and then use rowSums.
cols <- c('B', 'C', 'D')
include_mat <- outer(function(x, y) x != y, X = df$exclude, Y = cols)
# or outer(`!=`, X = df$exclude, Y = cols) if it's more readable to you
df$s <- rowSums(df[cols] * include_mat)
purrr approach
We can use a combination of nest and map_dbl for this:
library(tidyverse)
df %>%
rowwise %>%
nest(-blob) %>%
mutate(s = map_dbl(data, FUN)) %>%
unnest
Let's break that down a little bit. First, rowwise allows us to apply each subsequent function to support arbitrary complex operations that need to be applied to each row.
Next, nest will create a new column that is a list of our data to be fed into FUN (the beauty of tibbles vs data.frames!). Since we are applying this rowwise, each row contains a single-row tibble of exclude:D.
Finally, we use map_dbl to map our FUN to each of these tibbles. map_dbl is used over the family of other map_* functions since our intended output is numeric (i.e. double).
unnest returns our tibble into the more standard structure.
purrrlyr approach
While purrrlyr may not be as 'popular' as its parents dplyr and purrr, its by_row function has some utility here.
In your above example, we would use your data frame df and user-defined function FUN in the following way:
df %>%
by_row(..f = FUN, .to = "s", .collate = "cols")
That's it! Giving you:
# tibble [3 x 6]
exclude B C D blob s
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 B 1 3 1 fd 4
2 B 0 4 1 fs 5
3 D 0 9 0 sa 9
Admittedly, the syntax is a little strange, but here's how it breaks down:
..f = the function to apply to each row
.to = the name of the output column, in this case s
.collate = the way the results should be collated, by list, row, or column. Since FUN only has a single output, we would be fine to use either "cols" or "rows"
See here for more information on using purrrlyr...
Performance
Forewarning, while I like the functionality of by_row, it's not always the best approach for performance! purrr is more intuitive, but also at a rather large speed loss. See the following microbenchmark test:
library(microbenchmark)
mbm <- microbenchmark(
purrr.test = df %>% rowwise %>% nest(-blob) %>%
mutate(s = map_dbl(data, FUN)) %>% unnest,
purrrlyr.test = df %>% by_row(..f = FUN, .to = "s", .collate = "cols"),
rowwise.test = df %>%
rowwise %>%
do({
result = as_tibble(.)
result$s = FUN(result)
result
}),
group_by.test = df %>%
group_by(1:n()) %>%
do({
result = as_tibble(.)
result$s = FUN(result)
result
}),
sapply.test = {df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))},
times = 1000
)
autoplot(mbm)
You can see that the purrrlyr approach is faster than the approach of using a combination of do with rowwise or group_by(1:n()) (see #konvas answer), and rather on par with the sapply approach. However, the package is admittedly not the most intuitive. The standard purrr approach seems to be the slowest, but also perhaps easier to work with. Different user-defined functions may change the speed order.
I am having trouble figuring out how to perform a chisq.test within a nested list column of a data frame. If I need to turn the data list-column into a matrix, how do I do that, and then how do I properly refer to the variables for the chisq.test? Take the example below. Thank you!
Here is an example:
a <- rep(c('A', 'B'), 10)
b <- rep(c('a', 'b'), each = 10)
c <- as.numeric(rep(c(1:10), each = 2))
df <- as.data.frame(cbind(a, b, c)) %>%
mutate(c = as.numeric(c))
Is the distribution the same between factor 'b' (levels 'a' and 'b') with 'c' counts, within a subgroups of factor 'a'('A' and 'B')?
dfnest <- df %>%
nest(-a) %>%
mutate(chisq_p = map_dbl(data, ~chisq.test(.$b~.$c)$p.value))
The last line is what I want to accomplish, but the above is incorrect - how do I use the chisq.test within the list-column data, and insert the p.value into a new column?
Changing the arguments in the call of chisq.test returns the expected result.
df %>%
nest(-a) %>%
mutate(chisq_p = map_dbl(data, ~chisq.test(.)$p.value))
You can also use an anonymous function.
df %>%
nest(-a) %>%
mutate(chisq_p = map_dbl(data, function(f) { chisq.test(f)$p.value }))
Sample df:
df <- data.frame(x = c(runif(10,0,2*pi),runif(10,0,360)), group = gl(n = 2, k = 10, labels =c("A","B")))
I want to modify x only for group A (convert it to degrees). With base I just do:
df <- within(df,x[group == "A"] <- x[group == "A"]*180/pi)
I was wondering if there could be a way to do this with dplyr. This is wrong:
df <- df %>% filter(group == "A") %>% mutate(x = x*180/pi)
Because it returns only the subset of df where group == "A". Is there a (simple) way to do this, or is this a case where base trumps dplyr for ease of use?
We can use ifelse to create the logical condition, and based on that we either do the arithmetic calculation or else return the original values.
df %>%
mutate(x = ifelse(group=="A", x*180/pi, x))
Or as #AlexIoannides mentioned, if_else from dplyr can be used so as the type should be taken care of.
In data.table, this can be done by assignment in place and should be more efficient.
library(data.table)
setDT(df)[group=="A", x := x*180/pi]