I have a dataframe with millions of rows and tens of columns that I need to apply a rowwise operation. My solution below works using dplyr but I hope a switch to data.table will speed things up. Any help converting the below code to a data.table version would be appreciated.
library(tidyverse)
library(trend)
df = structure(list(id = 1:2, var = c(3L, 9L), col1_x = c("[(1,2,3)]",
"[(100,90,80,70,60,50,40,30,20)]"), col2_x = c("[(2,4,6)]", "[(100,50,25,12,6,3,1,1,1)]"
)), class = "data.frame", row.names = c(NA, -2L))
df = df %>%
mutate(across(ends_with("x"),~ gsub("[][()]", "", .)))
x_cols = df %>%
select(ends_with("x")) %>%
names()
df = df %>%
rowwise() %>%
mutate(across(all_of(x_cols) ,~ ifelse(var<=4,0,sens.slope(as.numeric(unlist(strsplit(., ','))))$estimates[[1]]),.)) %>%
ungroup()
While what #Ritchie Sacramento wrote is absolutely true, here's the information you asked for.
First, I want to start with set or :=. When you see the keyword set (which can just be part of the function name) or the := symbol, you've told data.table not to make copies of the data. Without declaring or declaration (that pesky = or <-), you've changed the data table. This is one of the key methods to prevent wasted memory with this package.
Keep in mind that the environment pane in RStudio is triggered to update when it registers that operator (= or <-), creating something new. Since you did a replace-in-place, the environment pane may reflect incorrect information. You can use the refresh icon (top right of the pane), or you can print the object to the console to check.
As soon as you declare anything that the pane identifies, everything in the pane is updated.
Change a data frame to a data.table. (Notice that keyword—set!) Both of these do the same thing. However, one copies everything in memory and makes it again. (Naming the frame the same thing does not prevent copies.)
setDT(df)
df <- data.table(df)
I'm not going to start with your first code blurb. I'm starting with the name extraction.
You wrote:
x_cols = df %>%
select(ends_with("x")) %>%
names()
# [1] "col1_x" "col2_x"
There are many ways to get this information. This is what I did. Note that this doesn't really have anything to do with data.table. I just used base R here. You could use a data frame the same way.
xcols <- names(df)[endsWith(names(df), 'x')]
# [1] "col1_x" "col2_x"
I'm going to use this object, xcols in the remaining examples. (Why keep reiterating the same declaration?)
You wrote the following to remove the brackets and parentheses.
df = df %>%
mutate(across(ends_with("x"),~ gsub("[][()]", "", .)))
# id var col1_x col2_x
# 1 1 3 1,2,3 2,4,6
# 2 2 9 100,90,80,70,60,50,40,30,20 100,50,25,12,6,3,1,1,1
There are several ways you could do this, whether in a data frame or a data.table. Here are a couple of methods you can use with data.table. These do the exact same thing as each other and your code.
Note the :=, which means the table changed.
In the first example, I used .SD and .SDcols. These are data column selection tools. You use .SD in place of the column name when you want to use more than one column. Then use .SDcols to tell data.table what columns you're trying to use. By annotating (xcols), where xcols is my variable representing my column names to use, this tells data.table to replace the data in the columns used for the aggregation.
The difference between these two is how I used lapply, which doesn't have anything to do with data.table. If you need more info on this function, you can ask me, or you can look through the many Q & As out there already.
df[,
(xcols) := lapply(.SD, function(k) gsub("[][()]", "", k)),
.SDcols = xcols]
df[,
(xcols) := lapply(.SD, gsub, pattern = "[][()]",
replacement = ""),
.SDcols = xcols]
Your last request was based on this code.
df %>%
rowwise() %>%
mutate(across(all_of(x_cols),
~ifelse(var <= 5, 0, sens.slope(
as.numeric(unlist(
strsplit(., ','))))$estimates[[1]]),.)) %>%
ungroup()
Since you used var to delineate when to apply this, I've used the by argument (as in dplyr's group_by). In terms of the other requirements, you'll see .SD and lapply again.
df[,
(xcols) := lapply(.SD,
function(k) {
ifelse(var <= 3, 0,
sens.slope(as.numeric(strsplit(k, ",")[[1]])
)$estimates[[1]])
}), by = var, .SDcols = xcols]
If you think about how these differ, you may find that, in a lot of ways, they aren't all that different. For example, in this last translation, you may see a similar approach in dplyr that I used.
df %>% group_by(var) %>%
mutate(across(all_of(x_cols),
~ifelse(var <= 5, 0, sens.slope(
as.numeric(unlist(
strsplit(., ','))))$estimates[[1]])))
I have a dataframe df containing a column of times and values in a set of columns called stage 1, stage 2,...,stage_50. I would like to divide all the values in the columns stage_1 to stage_50 by the corresponding value in the time column.
df<-data.frame(time=runif(10)*60,stage_1=runif(10)*10,stage_2=runif(10)*10, someOtherColumn=rep("A",10))
I can select the columns called stage and put them in another df.
df1<-df %>%
select(starts_with("stage")
then divide:
df1/df$time
but that doesn't seem very satisfactory. How can I use starts_with inside mutate?
e.g.
df%>%
mutate(starts_with("stage")/time)
1) across Use across:
library(dplyr)
df %>% mutate(across(starts_with("stage"), ~ . / time))
It could alternately be written like this:
df %>% mutate(across(starts_with("stage"), `/`, time))
2) pivot Another way to do this is to reshape into long form, perform the division and then reshape back.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(starts_with("stage")) %>%
mutate(value = value / time) %>%
pivot_wider
3) base R It can also be done readily in base R:
ok <- startsWith(names(df), "stage")
replace(df, ok, df[ok] / df$time)
4) ftransformv The collapse package has ftransformv to apply the indicated function to the selected columns. It is written in C/C++ and runs 13x faster than the base solution, 112x faster than the dplyr solution and 363x faster than the tidyr solution when I benchmarked it.
library(collapse)
ftransformv(df, startsWith(names(df), "stage"), `/`, time)
I am trying to fill the value of pca.data$Type with 'DMSO' if the 'DMSO' is present in the column of pca.data$sample in R.
pca.data$Type[pca.data$sample %in% "DMSO"]='DMSO'
pca.data$Type[grep("DMSO", pca.data$sample)] = "DMSO"
Several ways to do that. In addition to the base R method already proposed, you can use data.table or dplyr
data.table
Use conditional replacement with the := (update by reference)
dt <- data.table::as.data.table(pca.data)
dt[grepl("DMSO", get('sample')), Type := "DMSO"]
The above snippet makes assignment. If you want to visualize the output : dt[]
dplyr
You might use dplyr::if_else in this case
pca.data %>% dplyr::mutate(Type = if_else(grepl("DMSO", sample), 'DMSO', sample)
Consider the following dataframe:
df <- data.frame(replicate(5,sample(1:10, 10, rep=TRUE)))
If I want to divide each row by its sum (to make a probability distribution), I need to do something like this:
df %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs)
This really feels inefficient:
Create an rs column
Divide each of the values by their corresponding row rowSums()
Remove the temporarily created column to clean up the original dataframe.
When working with existing columns, it feels much more natural:
df %>% summarise_each(funs(weighted.mean(., X1)), -X1)
Using dplyr, would there a better way to work with temporary columns (created on-the-fly) than having to add and remove them after processing ?
I'm also interested in how data.table would handle such a task.
As I mentioned in a comment above I don't think that it makes sense to keep that data in either a data.frame or a data.table, but if you must, the following will do it without converting to a matrix and illustrates how to create a temporary variable in the data.table j-expression:
dt = as.data.table(df)
dt[, names(dt) := {sums = Reduce(`+`, .SD); lapply(.SD, '/', sums)}]
Why not considering base R as well:
as.data.frame(as.matrix(df)/rowSums(df))
Or just with your data.frame:
df/rowSums(df)
I am trying to transfer from plyr to dplyr. However, I still can't seem to figure out how to call on own functions in a chained dplyr function.
I have a data frame with a factorised ID variable and an order variable. I want to split the frame by the ID, order it by the order variable and add a sequence in a new column.
My plyr functions looks like this:
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- ddply(data, .(ID_variable), f)
In dplyr I though this should look something like this
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- data %>% group_by(ID_variable) %>% f
Can anyone tell me how to modify my dplyr call to successfully pass my own function and get the same functionality my plyr function provides?
EDIT: If I use the dplyr formula as described here, it DOES pass an object to f. However, while plyr seems to pass a number of different tables (split by the ID variable), dplyr does not pass one table per group but the ENTIRE table (as some kind of dplyr object where groups are annotated), thus when I cbind the Experience variable it appends a counter from 0 to the length of the entire table instead of the single groups.
I have found a way to get the same functionality in dplyr using this approach:
data <- data %>%
group_by(ID_variable) %>%
arrange(ID_variable,order_variable) %>%
mutate(Experience = 0:(n()-1))
However, I would still be keen to learn how to pass grouped variables split into different tables to own functions in dplyr.
For those who get here from google. Let's say you wrote your own print function.
printFunction <- function(dat) print(dat)
df <- data.frame(a = 1:6, b = 1:2)
As it was asked here
df %>%
group_by(b) %>%
printFunction(.)
prints entire data. To get dplyr print multiple tables grouped by, you should use do
df %>%
group_by(b) %>%
do(printFunction(.))