data.table: feeding list of logical conditions in i - r

I have to write a long script (call it Script) doing various operations on a data.table, then apply this a few times for different subsets of rows. I would like to be able to do the following:
condition <- "X>10"
source(Script)
... where Script will contain many of the following:
dt[MAGIC(condition), .......]
This would allow me to keep Script and the condition in different files (the second a markdown showing the results only, and which I'd like to be as simple as possible in terms of code).
What I don't want is to copy paste the script for each of the conditions, and manually change it, since this is way too error prone.
I tried lots of combinations of parse, deparse, substitute, quote, as.expression, as.logical, etc. but I seem to be staggering in the dark. I'd be really grateful if somebody could help!
NB: I can easily do the above in dplyr:
df %>% filter_(condition)
and of course I can also turn this back into a data.table
df %>% filter_(condition) %>% data.table()
... but I'd rather work consistently with data.table (faster, prefer the syntax, etc.)

We use eval(parse
setdT(dt)[eval(parse(text=condition))]

Related

How to change a dataframe's column types using tidy selection principles

I'm wondering what are the best practices to change a dataframe's column types ideally using tidy selection languages.
Ideally you would set the col types correctly up front when you import the data but that isn't always possible for various reasons.
So the next best pattern that I could identify is the below:
#random dataframe
df <- tibble(a_col=1:10,
b_col=letters[1:10],
c_col=seq.Date(ymd("2022-01-01"),by="day",length.out = 10))
My current favorite pattern involves using across() because I can use tidy selection verb to select variables that I want and then can "map" a formula to those.
# current favorite pattern
df<- df %>%
mutate(across(starts_with("a"),as.character))
Does anyone have any other favorite patterns or useful tricks here? It doesn't have to mutate. Often times I have to change the column types of dataframes with 100s of columns so it becomes quite tedious.
Yes this happens. Pain is where dates are in character format and if you once modify them and try to modify again (say in a mutate / summarise) there will be error.
In such a cases, change datatype only when you get to know what kind of data is there.
Select with names of columns id there is a sense in them
Check before applying the as.* if its already in that type with is.*
Applying it can be be by map / lapply / for loop, whatever is comfortable.
But it would be difficult to have a single approach for "all dataframes" as people try to name fields as per their choice or convenience.
Shared mine. Hope others help.

Using "count" function in a loop in R

I'm quite new to R and I've been learning with the available resources on the internet.
I came across this issue where I have a vector (a) with vars "1", "2", and "3". I want to use the count function to generate a new df with the categories for each of those variables and its frequencies.
The function I want to use in a loop is this
b <- count(mydata, var1)
However, when I use this loop below;
for (i in (a)) {
'j' <- count(mydata[, i])
print (j)
}
The loop happens but the frequencies which gets saved on j is only of the categorical variable "var 3".
Can someone assist me on this code please?
TIA!
In R there are generally better ways than to use loops to process data. In your particular case, the “straightforward” way fails, because the idea of the “tidyverse” is to have the data in tidy format (I highly recommend you read this article; it’s somewhat long but its explanation is really fundamental for any kind of data processing, even beyond the tidyverse). But (from the perspective of your code) your data is spread across multiple columns (wide format) rather than being in a single column (long form).
The other issue is that count (like many other tidyverse functions) expect an unevaluated column name. It does not accept the column name via a variable. akrun’s answer shows how you can work around this (using tidy evaluation and the bang-bang operator) but that’s a workaround that’s not necessary here.
The usual solution, instead of using a loop, would first require you to bring your data into long form, using pivot_longer.
After that, you can perform a single count on your data:
result <- mydata %>%
pivot_longer(all_of(a), names_to = 'Var', values_to = 'Value') %>%
count(Var, Value)
Some comments regarding your current approach:
Be wary of cryptic variable names: what are i, j and a? Use concise but descriptive variable names. There are some conventions where i and j are used but, if so, they almost exclusively refer to index variables in a loop over vector indices. Using them differently is therefore quite misleading.
There’s generally no need to put parentheses around a variable name in R (except when that name is the sole argument to a function call). That is, instead of for (i in (a)) it’s conventional to write for (i in a).
Don’t put quotes around your variable names! R happens to accept the code 'j' <- … but since quotes normally signify string literals, its use here is incredibly misleading, and additionally doesn’t serve a purpose.

R - Access an object own's name in apply functions

This is a problem I often encounters: I try to access an object's own name when using a function from apply family and spend hours figuring out how to do it... For instance (this is not the core of my question), today I was willing to inspect an attached package trying to figure out if it contained some non function objects. After a lot of tries and fails, I finally came up with (for the rrapply package - I know looking at the documentation is also easy but this one illustrates well the problem):
library(rrapply)
eapply(rlang::pkg_env('rrapply'), function(x) {if(!is.function(x)) x}) %>%
`[`(sapply(., function(x) !is.null(x))) %>%
names()
## [1] "renewable_energy_by_country" "pokedex"
I feel that is really too complicated for a simple test !
So my question: is there an easy way to loop through an object in base R (or maybe tidyverse) and return only the names of those elements that correspond to a certain condition ? rrapply seems to be able to achieve that but:
it is fairly complicated
and it seems to work on lists only and to loop through all sub-elements as well which is not desired
Thanks !
Identify the environment of interest, e, and then use eapply with the indicated function taking the names of the extracted elements at the end. This isn't conceptually different from the code in the question but does seem somewhat less complex when done in base R in the following way:
e <- as.environment("package:rrapply")
names(Filter(`!`, eapply(e, is.function)))
or the same code written as a pipeline:
library(magrittr)
"package:rrapply" %>%
as.environment %>%
eapply(is.function) %>%
Filter(`!`, .) %>%
names

Is there a way to apply plyr's count() function to every column individually?

Similar to this question but for R. I want to get a summary count of every variable in each column of a data frame.
Currently, doing something like plyr::count(df[,1:10]) checks for how many times every variable in a row match. Instead, I just want a quick way of printing out what all my variables even are, though. I know this can be done with C-style recursion, but I'm hoping for a more elegant/simpler solution.
You can use lapply:
lapply(df, plyr::count)
Alternatively, keeping everything in base R you can use table with stack to get similar output
lapply(df, function(x) stack(table(x)))

Sort a data.frame by multiple columns whose names are contained in a single object?

I want to sort a data.frame by multiple columns, ideally using base R without any external packages (though if necessary, so be it). Having read How to sort a dataframe by column(s)?, I know I can accomplish this with the order() function as long as I either:
Know the explicit names of each of the columns.
Have a separate object representing each individual column by which to sort.
But what if I only have one vector containing multiple column names, of length that's unknown in advance?
Say the vector is called sortnames.
data[order(data[, sortnames]), ] won't work, because order() treats that as a single sorting argument.
data[order(data[, sortnames[1]], data[, sortnames[2]], ...), ] will work if and only if I specify the exact correct number of sortname values, which I won't know in advance.
Things I've looked at but not been totally happy with:
eval(parse(text=paste("data[with(data, order(", paste(sortnames, collapse=","), ")), ]"))). Maybe this is fine, but I've seen plenty of hate for using eval(), so asking for alternatives seemed worthwhile.
I may be able to use the Deducer library to do this with sortData(), but like I said, I'd rather avoid using external packages.
If I'm being too stubborn about not using external packages, let me know. I'll get over it. All ideas appreciated in advance!
You can use do.call:
data<-data.frame(a=rnorm(10),b=rnorm(10))
data<-data.frame(a=rnorm(10),b=rnorm(10),c=rnorm(10))
sortnames <- c("a", "b")
data[do.call("order", data[sortnames]), ]
This trick is useful when you want to pass multiple arguments to a function and these arguments are in convenient named list.

Resources