using mutate_at from dplyr - r

I have a data frame with 5 columns and I want to produce 4 additional columns giving my the difference between the last 4 columns and the first column.
I tried the following, but that doesn't work:
library(tidyverse)
df <- as.tibble(data.frame(A = c(1,2), B = c(3,4), C = c(4,5), D = c(2,3), E = c(4,5)))
r_diff <- function(x,y){
z = y - x
return(z)
}
vars_to_process <- c("B","C","D","E")
df %>% mutate_at(.cols=vars_to_process, .funs =r_diff(.,df[,1])) %>% head()
Thanks
Renger

Here's the simplest way to do it.
df %>%
mutate_at(.vars = vars(B:E),
.funs = list(~ . - A))
The .vars argument lets you specify columns in the same way that you would specify columns in select(), provided you put that specification inside the function vars().
The .funs argument accepts an anonymous function defined on the fly inside a call to list(). And you can reference a column in the dataframe (in this case A) when defining this anonymous function (see this Stackoverflow question).
In addition, with the release of dplyr 1.0.0, you can now simply do the following:
df %>%
mutate(across(B:E, ~ . - A))

Here's a faster solution using base R code. Strategy is convert to a matrix, subtract column one from the required columns, build back into a data frame. Note this only returns the modified columns - if there are columns not in vars_to_process they'll not appear in the output but you didn't have any of those in your test set so I'll assume they don't exist.
So, always write things in functions whenever possible:
bsr = function(df,vars_to_process){
m = as.matrix(df)
data.frame(
A = m[, 1],
m[, 1] - m[, vars_to_process])}
Make some test data:
> df = data.frame(matrix(runif(5*1000), ncol=5))
> names(df)=LETTERS[1:5]
> dft = as.tibble(df)
> head(dft)
# A tibble: 6 x 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.2609174 0.07857624 0.2727817 0.8498004 0.3403234
2 0.3644744 0.95810657 0.8183856 0.2958133 0.4752349
3 0.6042914 0.98793218 0.7547003 0.9596591 0.5354045
4 0.4000441 0.61403331 0.9018804 0.3838347 0.3266855
5 0.6767012 0.11984219 0.9181570 0.5988404 0.6058629
Compare with the tidyverse version:
akr = function(df,vars_to_process){
df %>% mutate_at(vars_to_process, funs(r_diff(.,df[[1]])))
}
Check bsr and akr agree:
> head(bsr(dft, vars_to_process))
A B C D E
1 0.2609174 0.1823412 -0.01186432 -0.58888295 -0.07940594
2 0.3644744 -0.5936322 -0.45391119 0.06866108 -0.11076050
3 0.6042914 -0.3836408 -0.15040892 -0.35536765 0.06888696
4 0.4000441 -0.2139892 -0.50183635 0.01620939 0.07335861
> head(akr(dft, vars_to_process))
# A tibble: 6 x 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.2609174 0.1823412 -0.01186432 -0.58888295 -0.07940594
2 0.3644744 -0.5936322 -0.45391119 0.06866108 -0.11076050
3 0.6042914 -0.3836408 -0.15040892 -0.35536765 0.06888696
4 0.4000441 -0.2139892 -0.50183635 0.01620939 0.07335861
okay, except akr returns a tribble but nm. Benchmark:
> microbenchmark(bsr(dft, vars_to_process),akr(dft, vars_to_process))
Unit: microseconds
expr min lq mean median uq
bsr(dft, vars_to_process) 362.117 388.7215 488.9309 446.123 521.776
akr(dft, vars_to_process) 8070.391 8365.4230 9853.5239 8673.692 9335.613
Base R version is 26 times faster. I'd also argue that subtracting a column from another set of columns is tidier than applying a mutator function but as long as you wrap what your doing in a function it doesn't matter how messy the guts are.

We need to subset the column with [[ as the [ is still a data.frame
df %>%
mutate_at(vars_to_process, funs(r_diff(.,df[[1]])))
# A tibble: 2 x 5
# A B C D E
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 -2 -3 -1 -3
#2 2 -2 -3 -1 -3

Related

How to define a variable to record the number of processed rows when using R, dplyr and rowwise?

I have a function which needs a long time to run. So, I want to know how many rows of my data frame are processed. Usually, we can define a variable in for loop to deal with this easily. But I do not know how to do it in dplyr.
Let's say the code is:
library(tidyverse)
myFUN <-functin (x) {
x + 1
}
a <- tibble(id=c(1:3),x=c(3,5,1))
a1 <- a %>%
rowwise() %>%
mutate(y=myFUN(x))
I hope in somewhere the code, I can define a variable i. The value will be plus 1 every time one row is processed, then print its values in console like:
1
2
3
Can you pass another variable to the function which would be the row number of the dataframe and print it in the function. Something like :
myFUN <-function (x, y) {
message(y)
x + 1
}
and then use
library(dplyr)
a %>% mutate(y = purrr::map2_dbl(x, row_number(), myFUN))
#1
#2
#3
# A tibble: 3 x 3
# id x y
# <int> <dbl> <dbl>
#1 1 3 4
#2 2 5 6
#3 3 1 2
If your function is vectorized, you can let go map_dbl and do
a %>% mutate(y= myFUN(x, seq_len(n())))

unquote string as variable in pipe

I want to remove duplicate rows from a dataframe, for specific columns only. That can be obtained with distinct:
data <- tibble(a = c(1, 1, 2, 2), b = c(3, 3, 3, 4), z = c(5,4,5,5))
filtered_data <- data %>% distinct(a, b, .keep_all = T)
dim(filtered_data)
# [1] 3 3
This is (almost) what I need. Yet, my problem is that the columnnames I need to use with distinct will change. So I have a string gen that contains the names of the columns I want to use for with the distinct function. They need to get unquoted to be usefull in the pipe. I found suggestions to use as.name() or eval(parse()). This however gives me a different result:
gen <- c("a", "b")
filtered_data <- data %>% distinct(eval(parse(text = gen)), .keep_all = T)
dim(filtered_data)
# [1] 2 4
The eval seems to do something funny with the amount of times the data is filtered. (and, adds an extra column. I could live with that, though...) So, how to obtain a similar result, as if I had used a,b, but by using a variable instead?
additional information
I actually obtain gen by reading the columnnames of a dataframe: gen <- colnames(data)[1:2]. The solution suggested by #gymbrane would be perfect, if I had a way to transform the gen to c(a, b). The whole point is to avoid hardcoding the columnames. I tried things like gen <- noquotes(gen), which does not give an error in the rm_dup_rows function suggested below, but it does give a different result, giving the same sort of repeated filtering as I started with...
fixed
I think I got it working. It might be unelegant, and I'm not sure if every step is necessary for the result, but it seems to work by combining the function provided by #gymbrane below with ensym and quos in a forloop while adding to a list in GlobalEnv (edit: GlobalEnv isn't necessary):
unquote_string <- function(string) {
out <- list()
i <- 1
for (s in string) {
t <- ensym(s)
out[i] <-dplyr::quos(!!t)
i <- i+1
}
return(out)
}
gen_quo <- unquote_string(gen)
filtered_data <- rm_dup_rows(data, gen_quo)
dim(filtered_data)
# [1] 3 3
How about creating a function and using quosures . Perhaps something like this is what you are looking for...
rm_dup_rows <- function(data, ...){
vars = dplyr::quos(...)
data %>% distinct(!!! vars, .keep_all = T)
}
I believe this returns what you are asking for
rm_dup_rows(data = data, a, b)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
2 3 5
2 4 5
rm_dup_rows(data, b, z)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 4 5
Additional
You could modify rm_dup_rows just slightly and construct and your vector with quos. Something like this...
rm_dup_rows <- function(data, vars){
data %>% distinct(!!! vars, .keep_all = T)
}
# quos your column name vector
gen <- quos(a,z)
rm_dup_rows(data, gen)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 3 5

sapply results with dplyr

In the example below I am trying to determine which value is closest to each of the vals_int, by id. I can solve this problem using sapply() in a matter similar to below, but I am wondering if the sapply() part can be done with another function in dplyr.
I am really just interested in if the sapply method and output can be reproduced using some function(s) in the dplyr package. I had thought that do() may work but am struggling to determine how.
library(tidyverse)
df <- data_frame(
id = rep(1:10, 10) %>%
sort,
visit = rep(1:10, 10),
value = rnorm(100)
)
vals_int <- c(1, 2, 3)
tmp <- sapply(vals_int,
function(val_i) abs(df$value - val_i))
Yes, you can use the rowwise() and do() functions in dplyr to perform the same operation on every row, like so:
df %>% rowwise %>% do(diffs = abs(.$value - vals_int))
This will create a column called diffs in a new tibble which is a list of vectors with length 3. If you coerce the output that do() returns to be a data frame, it will instead create a tibble with three columns, one for each of the values subtracted.
df %>% rowwise %>% do(as.data.frame(t(abs(.$value - vals_int))))
The answer by #qdread does what you are looking for, but the tidyverse is starting to move away from the do() function (if that matters to you, idk). Here is an alternative method using map from the purrr package.
df %>%
mutate(closest = map(value, function(x){
abs(x - vals_int) %>%
t() %>%
as.tibble()
})) %>%
unnest()
That gives you this:
# A tibble: 100 x 6
id visit value V1 V2 V3
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.91813183 0.08186817 1.081868 2.081868
2 1 2 -1.68556173 2.68556173 3.685562 4.685562
3 1 3 -0.05984289 1.05984289 2.059843 3.059843
4 1 4 0.40128729 0.59871271 1.598713 2.598713
5 1 5 -0.09995526 1.09995526 2.099955 3.099955
6 1 6 0.81802663 0.18197337 1.181973 2.181973
7 1 7 -1.49244225 2.49244225 3.492442 4.492442
8 1 8 -0.74256185 1.74256185 2.742562 3.742562
9 1 9 -0.43943907 1.43943907 2.439439 3.439439
10 1 10 0.54985857 0.45014143 1.450141 2.450141
# ... with 90 more rows

Mutating columns of a data frame based on a predicate function (dplyr::mutate_if)

I would like to use dplyr's mutate_if() function to convert list-columns to data-frame-columns, but run into a puzzling error when I try to do so. I am using dplyr 0.5.0, purrr 0.2.2, R 3.3.0.
The basic setup looks like this: I have a data frame d, some of whose columns are lists:
d <- dplyr::data_frame(
A = list(
list(list(x = "a", y = 1), list(x = "b", y = 2)),
list(list(x = "c", y = 3), list(x = "d", y = 4))
),
B = LETTERS[1:2]
)
I would like to convert the column of lists (in this case, d$A) to a column of data frames using the following function:
tblfy <- function(x) {
x %>%
purrr::transpose() %>%
purrr::simplify_all() %>%
dplyr::as_data_frame()
}
That is, I would like the list-column d$A to be replaced by the list lapply(d$A, tblfy), which is
[[1]]
# A tibble: 2 x 2
x y
<chr> <dbl>
1 a 1
2 b 2
[[2]]
# A tibble: 2 x 2
x y
<chr> <dbl>
1 c 3
2 d 4
Of course, in this simple case, I could just do a simple reassignment. The point, however, is that I would like to do this programmatically, ideally with dplyr, in a generally applicable way that could deal with any number of list-columns.
Here's where I stumble: When I try to convert the list-columns to data-frame-columns using the following application
d %>% dplyr::mutate_if(is.list, funs(tblfy))
I get an error message that I don't know how to interpret:
Error: Each variable must be named.
Problem variables: 1, 2
Why does mutate_if() fail? How can I properly apply it to get the desired result?
Remark
A commenter has pointed out that the function tblfy() should be vectorized. That is a reasonable suggestion. But — unless I have vectorized incorrectly — that does not seem to get at the root of the problem. Plugging in a vectorized version of tblfy(),
tblfy_vec <- Vectorize(tblfy)
into mutate_if() fails with the error
Error: wrong result size (4), expected 2 or 1
Update
After gaining some experience with purrr, I now find the following approach natural, if somewhat long-winded:
d %>%
map_if(is.list, ~ map(., ~ map_df(., identity))) %>%
as_data_frame()
This is more or less identical to #alistaire's solution, below, but uses map_if(), resp. map(), in place of mutate_if(), resp. Vectorize().
The original tblfy function errors out for me (even when its elements are chained directly), so let's rebuild it a bit, adding vectorization as well, which lets us avoid an otherwise-necessary prior rowwise() call:
tblfy <- Vectorize(function(x){x %>% purrr::map_df(identity) %>% list()})
Now we can use mutate_if nicely:
d %>% mutate_if(purrr::is_list, tblfy)
## Source: local data frame [2 x 2]
##
## A B
## <list> <chr>
## 1 <tbl_df [2,2]> A
## 2 <tbl_df [2,2]> B
...and if we unnest to see what's there,
d %>% mutate_if(purrr::is_list, tblfy) %>% tidyr::unnest()
## Source: local data frame [4 x 3]
##
## B x y
## <chr> <chr> <dbl>
## 1 A a 1
## 2 A b 2
## 3 B c 3
## 4 B d 4
A couple notes:
map_df(identity) seems to be more efficient at building a tibble than any of the alternative formulations. I know the identity call seems unnecessary, but most everything else breaks.
I'm not sure how widely useful tblfy will be, as it's somewhat dependent on the structure of the lists in the list column, which can vary enormously. If you have a lot with a similar structure, I suppose it's useful, though.
There may be a way to do this with pmap instead of Vectorize, but I can't get it to work with some cursory tries.
In-place conversion without any copying:
library(data.table)
for (col in d) if (is.list(col)) lapply(col, setDF)
d
#Source: local data frame [2 x 2]
#
# A B
#1 <S3:data.frame> A
#2 <S3:data.frame> B

ACF by group in R

I would like to calculate the acf of a time series grouped by a grouping variable. Specifically, I have a data frame contaning a single time series (variable a) and a grouping variable (e. g. weekday, variable b). Here is an example:
data <- data.frame(a=rnorm(1:150), b=rep(rep(1:3, each=5), 10))
Now, I would like to calculate the acf for the different values of the grouping variable. For example, for lag 2 and group 1 I would like to get the correlation between t and t-2 calculated only over time points t with b=1 (the value of b for t-2 does not matter). I know that the function acf can easily calculate the acf but I don't find a way to include the grouping variable.
I could manually calculate the desired correlation but as I have a large data set and a lot of lags and values for the grouping variables, I would hope that there is a more elegant and faster way. Here is the manual calculation for the example above (lag 2, b=1):
sel <- which(data$b==1)
cor(data$a[sel[sel > 2]], data$a[sel[sel>2] - 2])
If the time series object is a tsibble, the following works for me. Assuming the data frame is called df and the variable you are interested in is called var. You can specify max lag additionally
df %>% group_by(Region) %>% ACF(var, lag_max = 18) %>% autoplot()
I'm not sure I understand exactly what information you are looking for but if you just want the acf values for multiple groups this should accomplish that. Some people have mentioned creating a tidy solution and this uses dplyr, tidyr, and purrr to do grouped calculations.
library(dplyr)
library(tidyr)
library(purrr)
sample_data <- dplyr::data_frame(group = sample(c("a", "b", "c"), size = 100, replace = T), value = sample.int(30, size = 100, replace = T))
head(sample_data)
#> # A tibble: 6 × 2
#> group value
#> <chr> <int>
#> 1 c 28
#> 2 c 9
#> 3 c 13
#> 4 c 11
#> 5 a 9
#> 6 c 9
grouped_acf_values <- sample_data %>%
tidyr::nest(-group) %>%
dplyr::mutate(acf_results = purrr::map(data, ~ acf(.x$value, plot = F)),
acf_values = purrr::map(acf_results, ~ drop(.x$acf))) %>%
tidyr::unnest(acf_values) %>%
dplyr::group_by(group) %>%
dplyr::mutate(lag = seq(0, n() - 1))
head(grouped_acf_values)
#> Source: local data frame [6 x 3]
#> Groups: group [1]
#>
#> group acf_values lag
#> <chr> <dbl> <int>
#> 1 c 1.00000000 0
#> 2 c -0.20192774 1
#> 3 c 0.07191805 2
#> 4 c -0.18440489 3
#> 5 c -0.31817935 4
#> 6 c 0.06368096 5
You can have a look at split to seperate your data.frame in buckets and then lapply to apply your function to each group. Something like:
groups_data <- split(data, data$b)
groups_acf <- lapply(groups_data, acf,...)
Then you have to extract the required information from the output list for instance with `sapply(groups,acf, FUN=function(acfobject){acfobject$value})
For groups computations, I would also definitiely go with new ways "à la" Hadley Wickham with %>% operator and group_by ; studing that is on my todo list...

Resources