I have a dataframe with millions of rows and tens of columns that I need to apply a rowwise operation. My solution below works using dplyr but I hope a switch to data.table will speed things up. Any help converting the below code to a data.table version would be appreciated.
library(tidyverse)
library(trend)
df = structure(list(id = 1:2, var = c(3L, 9L), col1_x = c("[(1,2,3)]",
"[(100,90,80,70,60,50,40,30,20)]"), col2_x = c("[(2,4,6)]", "[(100,50,25,12,6,3,1,1,1)]"
)), class = "data.frame", row.names = c(NA, -2L))
df = df %>%
mutate(across(ends_with("x"),~ gsub("[][()]", "", .)))
x_cols = df %>%
select(ends_with("x")) %>%
names()
df = df %>%
rowwise() %>%
mutate(across(all_of(x_cols) ,~ ifelse(var<=4,0,sens.slope(as.numeric(unlist(strsplit(., ','))))$estimates[[1]]),.)) %>%
ungroup()
While what #Ritchie Sacramento wrote is absolutely true, here's the information you asked for.
First, I want to start with set or :=. When you see the keyword set (which can just be part of the function name) or the := symbol, you've told data.table not to make copies of the data. Without declaring or declaration (that pesky = or <-), you've changed the data table. This is one of the key methods to prevent wasted memory with this package.
Keep in mind that the environment pane in RStudio is triggered to update when it registers that operator (= or <-), creating something new. Since you did a replace-in-place, the environment pane may reflect incorrect information. You can use the refresh icon (top right of the pane), or you can print the object to the console to check.
As soon as you declare anything that the pane identifies, everything in the pane is updated.
Change a data frame to a data.table. (Notice that keyword—set!) Both of these do the same thing. However, one copies everything in memory and makes it again. (Naming the frame the same thing does not prevent copies.)
setDT(df)
df <- data.table(df)
I'm not going to start with your first code blurb. I'm starting with the name extraction.
You wrote:
x_cols = df %>%
select(ends_with("x")) %>%
names()
# [1] "col1_x" "col2_x"
There are many ways to get this information. This is what I did. Note that this doesn't really have anything to do with data.table. I just used base R here. You could use a data frame the same way.
xcols <- names(df)[endsWith(names(df), 'x')]
# [1] "col1_x" "col2_x"
I'm going to use this object, xcols in the remaining examples. (Why keep reiterating the same declaration?)
You wrote the following to remove the brackets and parentheses.
df = df %>%
mutate(across(ends_with("x"),~ gsub("[][()]", "", .)))
# id var col1_x col2_x
# 1 1 3 1,2,3 2,4,6
# 2 2 9 100,90,80,70,60,50,40,30,20 100,50,25,12,6,3,1,1,1
There are several ways you could do this, whether in a data frame or a data.table. Here are a couple of methods you can use with data.table. These do the exact same thing as each other and your code.
Note the :=, which means the table changed.
In the first example, I used .SD and .SDcols. These are data column selection tools. You use .SD in place of the column name when you want to use more than one column. Then use .SDcols to tell data.table what columns you're trying to use. By annotating (xcols), where xcols is my variable representing my column names to use, this tells data.table to replace the data in the columns used for the aggregation.
The difference between these two is how I used lapply, which doesn't have anything to do with data.table. If you need more info on this function, you can ask me, or you can look through the many Q & As out there already.
df[,
(xcols) := lapply(.SD, function(k) gsub("[][()]", "", k)),
.SDcols = xcols]
df[,
(xcols) := lapply(.SD, gsub, pattern = "[][()]",
replacement = ""),
.SDcols = xcols]
Your last request was based on this code.
df %>%
rowwise() %>%
mutate(across(all_of(x_cols),
~ifelse(var <= 5, 0, sens.slope(
as.numeric(unlist(
strsplit(., ','))))$estimates[[1]]),.)) %>%
ungroup()
Since you used var to delineate when to apply this, I've used the by argument (as in dplyr's group_by). In terms of the other requirements, you'll see .SD and lapply again.
df[,
(xcols) := lapply(.SD,
function(k) {
ifelse(var <= 3, 0,
sens.slope(as.numeric(strsplit(k, ",")[[1]])
)$estimates[[1]])
}), by = var, .SDcols = xcols]
If you think about how these differ, you may find that, in a lot of ways, they aren't all that different. For example, in this last translation, you may see a similar approach in dplyr that I used.
df %>% group_by(var) %>%
mutate(across(all_of(x_cols),
~ifelse(var <= 5, 0, sens.slope(
as.numeric(unlist(
strsplit(., ','))))$estimates[[1]])))
Related
I have a dplyr version of group_by in which I try to cut a column called ratio into different ranges. It is working fine. I am not able to update dplyr to new version due to managed R studio (managed by a common admin). Is there any way to rewrite the same logic into data.table way
output <- output %>%
group_by(start = as.IDate(timestamp),VAV = van_d, conditions = cut(output$ratio, breaks=c(0,0.7,0.8,0.9,1,100),labels=c("0-0.7","0.7-0.8","0.8-0.9","0.9-1",">1"))) %>%
summarise(duration = n()) %>%
ungroup %>%
na.omit
With data.table, the general usage is
dt[i, j, by]
where i is the index to subset the rows i.e. it can take numeric or a logical expression, j - for the columns, and by for grouping. This usage is applicable only to data.table objects. So, if the input dataset is data.frame or tibble, convert to data.table with either as.data.table (wouldn't change the original object) or with setDT (which converts to data.table by reference).
Now, we specify the i, j and by. In the dplyr code, there is no filtering i.e. we don't need to specify the i - so it remains blank. The group_by code will go into the by. It can be a list or a syntax .(, then the j for summarise also is a list (.(duration = .N)). The .N is similar to n() from dplyr
library(data.table)
setDT(output)[, .(duration = .N),
by = .(start = as.IDate(timestamp),VAV = van_d,
conditions = cut(ratio, breaks=c(0,0.7,0.8,0.9,1,100),
labels=c("0-0.7","0.7-0.8","0.8-0.9","0.9-1",">1"))]
I'm working with datasets (from smartphone experience sampling) where I have to very frequently performed grouped operations (such as find the variability of a measure within each person, or within each day within each person, etc). Typical code might look like the code below, which calculates within-day variability for some variables, then takes the mean of the within-day variability and joins it to the original data.
output <- group_by(mydata, id, day) %>%
mutate_at(vars(angr, sad, guil, anx, hap), funs(sd(., na.rm = TRUE))) %>%
ungroup() %>%
group_by(id) %>%
summarize_at(vars(angr, sad, guil, anx, hap), funs('var_day_mean' = mean(., na.rm = TRUE))) %>%
join(mydata, .)
What I want to do is be able to save this as a function so that instead of having to type out angr, sad, guil, anx, hap many times over, I can call this code (and slight variations on it saved as different functions) on a vector of variable names in a string. So the desired functionality is:
vars <- c('angr', 'sad', 'guil', 'anx', 'hap')
output <- myfunc(vars)
Where myfunc performs the piped operations above.
I'm aware that there is a vignette for non standard evaluation using dplyr but it's very limited and doesn't cover mutate or most of what I need to do with this use case, so would appreciate any insight.
Reproducible example - what I desire is essentially that the below code work, but currently the dplyr pipe cannot take vars as a character vector the way I have input it.
Edit: I was mistaken - the below code does work, and dplyr can function in this way (and can also take character vectors to group_by, making this easy to program with). I leave the code below as a (working) reference.
data <- data.frame('ID' = rep(1:10, each = 10),
'day' = rep(c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), 10),
'anx' = rnorm(100), 'sad' = rnorm(100), 'hap' = rnorm(100))
vars = c('anx', 'sad', 'hap')
out <- group_by(data, ID, day) %>%
mutate_at(vars, funs(sd(., na.rm = TRUE)))
With mutate_at you can simply supply the names of the columns as a vector:
mtcars %>% mutate_at(c("mpg", "hp"), funs(mean))
This should do the trick.
Is there a better way to add rows within group_by() groups than using bind_rows()? Here's an example that's a little clunky:
df <- data.frame(a=c(1,1,1,2,2), b=1:5)
df %>%
group_by(a) %>%
do(bind_rows(data.frame(a=.$a[1], b=0), ., data.frame(a=.$a[1], b=10)))
The idea is that columns that we're already grouping on could be inferred from the groups.
I was wondering whether something like this could work instead:
df %>%
group_by(a) %>%
insert(b=0, .at=0) %>%
insert(b=10)
Like append(), it could default to inserting after all existing elements, and it could be smart enough to use group values for any columns unspecified. Maybe use NA for non-grouping columns unspecified.
Is there an existing convenient syntax I've missed, or would this be helpful?
Here's an approach using data.table:
library(data.table)
setDT(df)
rbind(df, expand.grid(b = c(0, 10), a = df[ , unique(a)]))[order(a, b)]
Depending on your actual context this much simpler alternative would work too:
df[ , .(b = c(0, b, 10)), by = a]
(and we can simply use c(0, b, 10) in j if we don't care about keeping the name b)
The former has the advantage that it will work even if df has more columns -- just have to set fill = TRUE for rbind.data.table.
I am trying to order a dataframe by making use of dplyr::arrange. The issue is that the column I am trying to sort on contains both a fixed string followed by a number, as for instance generated by the dummycode below.
dummydf<-data.frame(values=rnorm(100),sortcol=paste0("ABC",sample(1:100,100,replace=FALSE)))
By default, using dummydf %>% arrange(sortcol) would generate a df which is sorted alphanumerically (?) but this is of course not the desired result:
values sortcol
0.708081720 ABC1
0.041348322 ABC10
1.730962886 ABC100
0.423480861 ABC11
-1.545837266 ABC12
-1.345539947 ABC13
-0.078998792 ABC14
0.088712174 ABC15
0.670583024 ABC16
1.238837680 ABC17
-1.459044293 ABC18
-2.028535223 ABC19
0.779514385 ABC2
1.360509910 ABC20
In this example, I would like to sort the column as gtools::mixedsort would do, making sure ABC2 follows ABC1 and is not preceed by ABC1-19 and ABC100 mixedsort(as.character(dummydf$sortcol)) would do that trick.
Now, I am aware I could do this by using sub in my arrange argument: dummydf %>% arrange(as.numeric(sub("ABC","",sortcol))) but that is mainly because my string is something fixed (although any regex could be used to capture the last digits following any string I suppose).
I am just wondering: is there a more "elegant" and generic way to get this done with dplyr::arrange, in the same fashion as gtools::mixedsort?
Kind regards,
FM
Here's a functional solution making use of the mysterious identity order(order(x)) == rank(x).
mixedrank = function(x) order(gtools::mixedorder(x))
dummydf %>% dplyr::arrange(mixedrank(sortcol))
I don't see this answer posted so I'll throw it out. You can use mixedorder with slice to arrange it.
dummydf %>%
slice(mixedorder(sortcol))
Using data.table
library(data.table)
dummydf = data.table(dummydf)
dummydf[gtools::mixedorder(as.character(sortcol))]
Honestly just copied your example and stuck it in as the select argument in the data.table syntax. You already did all the hard work :).
Credit to Akhil Nair for his data.table answer which is what the first code snippet derives from. If you like the data.table answer but still want magrittr piping, you can consider calculating a new column and using piping with data.table to get your output:
dummydf %>%
dplyr::mutate(row_lookup = gtools::mixedorder(as.character(sortcol))) %>%
data.table::data.table() %>%
.[.$row_lookup]
I think it's debatable whether that helps or detracts from the readability.
If you don't want to call data.table, you can go through some extra contortions to calculate a column you can use dplyr::arrange on. Here's one example:
library(dplyr)
bind_cols(dummydf,
dummydf %>%
tibble::rowid_to_column("order") %>%
mutate(rowname = gtools::mixedorder(as.character(sortcol))) %>%
arrange(rowname) %>%
select(order)) %>%
arrange(order)
I think this code is more confusing to read and isn't worth those extra contortions to avoid data.table.
Here is a solution that will allow for sorting if there are repeats and multiple conditions to sort. Most previous answers are not generic: they freeze the ordering at level 1.
df <- data.frame(values = rnorm(100),
sortcol1 = paste0("ASORT", sample(1:100, 100, replace = TRUE)),
sortcol2 = paste0("BSORT", sample(1:100, 100, replace = TRUE)),
stringsAsFactors = F)
df %>%
mutate(
`sortcol1` = factor(`sortcol1`, ordered = T, levels = unique(gtools::mixedsort(`sortcol1`))),
`sortcol2` = factor(`sortcol2`, ordered = T, levels = unique(gtools::mixedsort(`sortcol2`)))
) %>%
arrange(`sortcol1`, `sortcol2`)
Consider the following dataframe:
df <- data.frame(replicate(5,sample(1:10, 10, rep=TRUE)))
If I want to divide each row by its sum (to make a probability distribution), I need to do something like this:
df %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs)
This really feels inefficient:
Create an rs column
Divide each of the values by their corresponding row rowSums()
Remove the temporarily created column to clean up the original dataframe.
When working with existing columns, it feels much more natural:
df %>% summarise_each(funs(weighted.mean(., X1)), -X1)
Using dplyr, would there a better way to work with temporary columns (created on-the-fly) than having to add and remove them after processing ?
I'm also interested in how data.table would handle such a task.
As I mentioned in a comment above I don't think that it makes sense to keep that data in either a data.frame or a data.table, but if you must, the following will do it without converting to a matrix and illustrates how to create a temporary variable in the data.table j-expression:
dt = as.data.table(df)
dt[, names(dt) := {sums = Reduce(`+`, .SD); lapply(.SD, '/', sums)}]
Why not considering base R as well:
as.data.frame(as.matrix(df)/rowSums(df))
Or just with your data.frame:
df/rowSums(df)