I am wondering if there is an efficient way or alternative way to compute the row wise product of a selection of columns in dplyr format.
I know one way to do it (see below), but it seems using rowwise() take a long time to run on my large data set, hence looking for any alternative way to do this.
df = df %>%
rowwise %>%
mutate(myprod = prod(c_across(starts_with('var_xyz'))))
Here are some alternative options.
If you want to stay in tidyverse you can try pmap_dbl :
library(dplyr)
library(purrr)
df %>% mutate(myprod = pmap_dbl(select(., starts_with('var_xyz')), prod))
A base R option with Reduce or using rowProds from matrixStats.
cols <- grep('^var_xyz', names(df))
#2.
df$myprod <- Reduce(`*`, df[cols])
#3.
df$myprod <- matrixStats::rowProds(as.matrix(df[cols]))
Related
I have a dataframe df containing a column of times and values in a set of columns called stage 1, stage 2,...,stage_50. I would like to divide all the values in the columns stage_1 to stage_50 by the corresponding value in the time column.
df<-data.frame(time=runif(10)*60,stage_1=runif(10)*10,stage_2=runif(10)*10, someOtherColumn=rep("A",10))
I can select the columns called stage and put them in another df.
df1<-df %>%
select(starts_with("stage")
then divide:
df1/df$time
but that doesn't seem very satisfactory. How can I use starts_with inside mutate?
e.g.
df%>%
mutate(starts_with("stage")/time)
1) across Use across:
library(dplyr)
df %>% mutate(across(starts_with("stage"), ~ . / time))
It could alternately be written like this:
df %>% mutate(across(starts_with("stage"), `/`, time))
2) pivot Another way to do this is to reshape into long form, perform the division and then reshape back.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(starts_with("stage")) %>%
mutate(value = value / time) %>%
pivot_wider
3) base R It can also be done readily in base R:
ok <- startsWith(names(df), "stage")
replace(df, ok, df[ok] / df$time)
4) ftransformv The collapse package has ftransformv to apply the indicated function to the selected columns. It is written in C/C++ and runs 13x faster than the base solution, 112x faster than the dplyr solution and 363x faster than the tidyr solution when I benchmarked it.
library(collapse)
ftransformv(df, startsWith(names(df), "stage"), `/`, time)
I want to demean all my columns using dplyr. I tried but failed using the "do()" command.
I basically want to replicate the following using easier dplyr commands:
tickers <- c(rep(1,10),rep(2,10))
df <- data.frame(cbind(tickers,rep(1:20),rep(2:21)))
colnames(df) <- c("tickers","col1","col2")
df %>% group_by(tickers)
apply(df[,2:3],2,function(x) x - mean(x))
I am sure this can be done much better using dplyr.
Thanks!
If we are using dplyr, we can do this with mutate_each and use any of the methods mentioned in ?select to match the columns. Here, I am using matches which can take regular expression as pattern.
library(dplyr)
df %>%
mutate_each(funs(.-mean(.)), matches('^col')) %>%
select(-tickers)
But this can be done also using base R:
df[2:3]-colMeans(df[2:3])[col(df[2:3])]
The colMeans output is a vector which can be replicated so that the lengths will be the same.
I have a dplyr question: How do I use transmute over each column without writing each column out by hand? I.e. is there something like transmute_each()?
I want to do the following: Using dplyr I want to get the z-score of each column for a MWE below:
tickers <- c(rep(1,10),rep(2,10))
df <- data.frame(cbind(tickers,rep(1:20),rep(2:21),rep(2:21),rep(4:23),rep(3:22)))
colnames(df) <- c("tickers","col1","col2","col3","col4","col5")
df %>% group_by(tickers)
Is there a simple way to then use transmute to achieve the following:
for(i in 2:ncol(df)){
df[,i] <- df[,i] - mean(df[,i])/sd(df[,i])
}
Many thanks
Now that there is a transmute_at() function (as of dplyr 0.7), you can do the following:
df %>%
group_by(tickers) %>%
transmute_at(.vars = vars(starts_with("col")),
.funs = funs(scale(.))) %>%
ungroup
Note that this uses the scale() function from base R, which by default converts a numeric vector into a z-score.
Also, the use of vars() in the .vars argument allows you to use all the helper functions that are available for dplyr's select(), such as one_of(), ends_with(), etc.
Finally, instead of writing funs(scale(.)) here, since you're using a simple function in the .funs argument, you can just write .funs = scale.
I solved this using the following:
df %>%
group_by(tickers) %>%
mutate_at(.funs = funs((. - mean(.))/sd(.)),
.cols = vars(matches("col")))
Is it possible to set all column names to upper or lower within a dplyr or magrittr chain?
In the example below I load the data and then, using a magrittr pipe, chain it through to my dplyr mutations. In the 4th line I use the tolower function , but this is for a different purpose: to create a new variable with lowercase observations.
mydata <- read.csv('myfile.csv') %>%
mutate(Year = mdy_hms(DATE),
Reference = (REFNUM),
Event = tolower(EVENT)
I'm obviously looking for something like colnames = tolower but know this doesn't work/exist.
I note the dplyr rename function but this isn't really helpful.
In magrittr the colname options are:
set_colnames instead of base R's colnames<-
set_names instead of base R's names<-
I've tried numerous permutations with these but no dice.
Obviously this is very simple in base r.
names(mydata) <- tolower(names(mydata))
However it seems incongruous with the dplyr/magrittr philosophies that you'd have to do that as a clunky one liner, before moving on to an elegant chain of dplyr/magrittr code.
with {dplyr} we can do :
mydata %>% rename_all(tolower)
or
mydata %>% rename(across(everything(), tolower))
iris %>% setNames(tolower(names(.))) %>% head
Or equivalently use replacement function in non-replacement form:
iris %>% `names<-`(tolower(names(.))) %>% head
iris %>% `colnames<-`(tolower(names(.))) %>% head # if you really want to use `colnames<-`
Using magrittr's "compound assignment pipe-operator" %<>% might be, if I understand your question correctly, an even more succinct option.
library("magrittr")
names(iris) %<>% tolower
?`%<>%` # for more
mtcars %>%
set_colnames(value = casefold(colnames(.), upper = FALSE)) %>%
head
casefold is available in base R and can convert in both direction, i.e. can convert to either all upper case or all lower case by using the flag upper, as need might be.
Also colnames() will use only column headers for case conversion.
You could also define a function:
upcase <- function(df) {
names(df) <- toupper(names(df))
df
}
library(dplyr)
mtcars %>% upcase %>% select(MPG)
I am trying to transfer from plyr to dplyr. However, I still can't seem to figure out how to call on own functions in a chained dplyr function.
I have a data frame with a factorised ID variable and an order variable. I want to split the frame by the ID, order it by the order variable and add a sequence in a new column.
My plyr functions looks like this:
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- ddply(data, .(ID_variable), f)
In dplyr I though this should look something like this
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- data %>% group_by(ID_variable) %>% f
Can anyone tell me how to modify my dplyr call to successfully pass my own function and get the same functionality my plyr function provides?
EDIT: If I use the dplyr formula as described here, it DOES pass an object to f. However, while plyr seems to pass a number of different tables (split by the ID variable), dplyr does not pass one table per group but the ENTIRE table (as some kind of dplyr object where groups are annotated), thus when I cbind the Experience variable it appends a counter from 0 to the length of the entire table instead of the single groups.
I have found a way to get the same functionality in dplyr using this approach:
data <- data %>%
group_by(ID_variable) %>%
arrange(ID_variable,order_variable) %>%
mutate(Experience = 0:(n()-1))
However, I would still be keen to learn how to pass grouped variables split into different tables to own functions in dplyr.
For those who get here from google. Let's say you wrote your own print function.
printFunction <- function(dat) print(dat)
df <- data.frame(a = 1:6, b = 1:2)
As it was asked here
df %>%
group_by(b) %>%
printFunction(.)
prints entire data. To get dplyr print multiple tables grouped by, you should use do
df %>%
group_by(b) %>%
do(printFunction(.))