subtract mean from every element dplyr - r

I want to demean all my columns using dplyr. I tried but failed using the "do()" command.
I basically want to replicate the following using easier dplyr commands:
tickers <- c(rep(1,10),rep(2,10))
df <- data.frame(cbind(tickers,rep(1:20),rep(2:21)))
colnames(df) <- c("tickers","col1","col2")
df %>% group_by(tickers)
apply(df[,2:3],2,function(x) x - mean(x))
I am sure this can be done much better using dplyr.
Thanks!

If we are using dplyr, we can do this with mutate_each and use any of the methods mentioned in ?select to match the columns. Here, I am using matches which can take regular expression as pattern.
library(dplyr)
df %>%
mutate_each(funs(.-mean(.)), matches('^col')) %>%
select(-tickers)
But this can be done also using base R:
df[2:3]-colMeans(df[2:3])[col(df[2:3])]
The colMeans output is a vector which can be replicated so that the lengths will be the same.

Related

Product of columns selected by starts_with()

I am wondering if there is an efficient way or alternative way to compute the row wise product of a selection of columns in dplyr format.
I know one way to do it (see below), but it seems using rowwise() take a long time to run on my large data set, hence looking for any alternative way to do this.
df = df %>%
rowwise %>%
mutate(myprod = prod(c_across(starts_with('var_xyz'))))
Here are some alternative options.
If you want to stay in tidyverse you can try pmap_dbl :
library(dplyr)
library(purrr)
df %>% mutate(myprod = pmap_dbl(select(., starts_with('var_xyz')), prod))
A base R option with Reduce or using rowProds from matrixStats.
cols <- grep('^var_xyz', names(df))
#2.
df$myprod <- Reduce(`*`, df[cols])
#3.
df$myprod <- matrixStats::rowProds(as.matrix(df[cols]))

dplyr - convert column names containing words to character

I want to convert column names that start with the word "feature" to character type using dplyr. I tried the below and a few other variations using answers from stackoverflow. Any help would be appreciated. Thanks!
train %>% mutate_if(vars(starts_with("feature")), funs(as.character(.)))
train %>% mutate_if(vars(starts_with("feature")), funs(as.character(.)))
I am trying to improve my usage of dplyr commands.
You need mutate_at instead
library(dplyr)
train %>% mutate_at(vars(starts_with("feature")), as.character)
As #Gregor mentioned, mutate_if is when selection of column is based on the actual data in the column and not the names.
For example,
iris %>% mutate_if(is.numeric, sqrt)
So if the data in the column is numeric only then it will calculate square root.
If we want to combine multiple vars statement into one we can use matches
merchants %>% mutate_at(vars(matches("_id|category_")), as.character)

dplyr mutate using dynamic variable name while respecting group_by

I'm trying as per
dplyr mutate using variable columns
&
dplyr - mutate: use dynamic variable names
to use dynamic names in mutate. What I am trying to do is to normalize column data by groups subject to a minimum standard deviation. Each column has a different minimum standard deviation
e.g. (I omitted loops & map statements for convenience)
require(dplyr)
require(magrittr)
data(iris)
iris <- tbl_df(iris)
minsd <- c('Sepal.Length' = 0.8)
varname <- 'Sepal.Length'
iris %>% group_by(Species) %>% mutate(!!varname := mean(pluck(iris,varname),na.rm=T)/max(sd(pluck(iris,varname)),minsd[varname]))
I got the dynamic assignment & variable selection to work as suggested by the reference answers. But group_by() is not respected which, for me at least, is the main benefit of using dplyr here
desired answer is given by
iris %>% group_by(Species) %>% mutate(!!varname := mean(Sepal.Length,na.rm=T)/max(sd(Sepal.Length),minsd[varname]))
Is there a way around this?
I actually did not know much about pluck, so I don't know what went wrong, but I would go for this and this works:
iris %>%
group_by(Species) %>%
mutate(
!! varname :=
mean(!!as.name(varname), na.rm = T) /
max(sd(!!as.name(varname)),
minsd[varname])
)
Let me know if this isn't what you were looking for.
The other answer is obviously the best and it also solved a similar problem that I have encountered. For example, with !!as.name(), there is no need to use group_by_() (or group_by_at or arrange_() (or arrange_at()).
However, another way is to replace pluck(iris,varname) in your code with .data[[varname]]. The reason why pluck(iris,varname) does not work is that, I suppose, iris in pluck(iris,varname) is not grouped. However, .data refer to the tibble that executes mutate(), and so is grouped.
An alternative to as.name() is rlang::sym() from the rlang package.

pass grouped dataframe to own function in dplyr

I am trying to transfer from plyr to dplyr. However, I still can't seem to figure out how to call on own functions in a chained dplyr function.
I have a data frame with a factorised ID variable and an order variable. I want to split the frame by the ID, order it by the order variable and add a sequence in a new column.
My plyr functions looks like this:
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- ddply(data, .(ID_variable), f)
In dplyr I though this should look something like this
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- data %>% group_by(ID_variable) %>% f
Can anyone tell me how to modify my dplyr call to successfully pass my own function and get the same functionality my plyr function provides?
EDIT: If I use the dplyr formula as described here, it DOES pass an object to f. However, while plyr seems to pass a number of different tables (split by the ID variable), dplyr does not pass one table per group but the ENTIRE table (as some kind of dplyr object where groups are annotated), thus when I cbind the Experience variable it appends a counter from 0 to the length of the entire table instead of the single groups.
I have found a way to get the same functionality in dplyr using this approach:
data <- data %>%
group_by(ID_variable) %>%
arrange(ID_variable,order_variable) %>%
mutate(Experience = 0:(n()-1))
However, I would still be keen to learn how to pass grouped variables split into different tables to own functions in dplyr.
For those who get here from google. Let's say you wrote your own print function.
printFunction <- function(dat) print(dat)
df <- data.frame(a = 1:6, b = 1:2)
As it was asked here
df %>%
group_by(b) %>%
printFunction(.)
prints entire data. To get dplyr print multiple tables grouped by, you should use do
df %>%
group_by(b) %>%
do(printFunction(.))

Split up a dataframe by number of rows

I have a dataframe made up of 400'000 rows and about 50 columns. As this dataframe is so large, it is too computationally taxing to work with.
I would like to split this dataframe up into smaller ones, after which I will run the functions I would like to run, and then reassemble the dataframe at the end.
There is no grouping variable that I would like to use to split up this dataframe. I would just like to split it up by number of rows. For example, I would like to split this 400'000-row table into 400 1'000-row dataframes.
How might I do this?
Make your own grouping variable.
d <- split(my_data_frame,rep(1:400,each=1000))
You should also consider the ddply function from the plyr package, or the group_by() function from dplyr.
edited for brevity, after Hadley's comments.
If you don't know how many rows are in the data frame, or if the data frame might be an unequal length of your desired chunk size, you can do
chunk <- 1000
n <- nrow(my_data_frame)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(my_data_frame,r)
You could also use
r <- ggplot2::cut_width(1:n,chunk,boundary=0)
For future readers, methods based on the dplyr and data.table packages will probably be (much) faster for doing group-wise operations on data frames, e.g. something like
(my_data_frame
%>% mutate(index=rep(1:ngrps,each=full_number)[seq(.data)])
%>% group_by(index)
%>% [mutate, summarise, do()] ...
)
There are also many answers here
I had a similar question and used this:
library(tidyverse)
n = 100 #number of groups
split <- df %>% group_by(row_number() %/% n) %>% group_map(~ .x)
from left to right:
you assign your result to split
you start with df as your input dataframe
then you group your data by dividing the row_number by n (number of groups) using modular division.
then you just pass that group through the group_map function which returns a list.
So in the end your split is a list with in each element a group of your dataset.
On the other hand, you could also immediately write your data by replacing the group_map call by e.g. group_walk(~ write_csv(.x, paste0("file_", .y, ".csv"))).
You can find more info on these powerful tools on:
Cheat sheet of dplyr explaining group_by
and also below for:
group_map, group_walk follow up functions

Resources