Random sampling of parquet prior to collect - r

I want to randomly sample a dataset. If I already have that dataset loaded, I can do something like this:
library(dplyr)
set.seed(-1)
mtcars %>% slice_sample(n = 3)
# mpg cyl disp hp drat wt qsec vs am gear carb
# AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.3 0 0 3 2
# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.0 1 0 4 2
But my dataset is stored as a parquet file. As an example, I'll create a parquet from mtcars:
library(arrow)
# Create parquet file
write_dataset(mtcars, "~/mtcars", format = "parquet")
open_dataset("~/mtcars") %>%
slice_sample(n = 3) %>%
collect()
# Error in UseMethod("slice_sample") :
# no applicable method for 'slice_sample' applied to an object of class "c('FileSystemDataset', 'Dataset', 'ArrowObject', 'R6')"
Clearly, slice_sample isn't implemented for parquet files and neither is slice:
open_dataset("~/mtcars") %>% nrow() -> n
subsample <- sample(1:n, 3)
open_dataset("~/mtcars") %>%
slice(subsample) %>%
collect()
# Error in UseMethod("slice") :
# no applicable method for 'slice' applied to an object of class "c('FileSystemDataset', 'Dataset', 'ArrowObject', 'R6')"
Now, I know filter is implemented, so I tried that:
open_dataset("~/mtcars") %>%
filter(row_number() %in% subsample) %>%
collect()
# Error: Filter expression not supported for Arrow Datasets: row_number() %in% subsample
# Call collect() first to pull data into R.
(This also doesn't work if I create a filtering vector first, e.g., foo <- rep(FALSE, n); foo[subsample] <- TRUE and use that in filter.)
This error offers some helpful advice, though: collect the data and then subsample. The issue is that the file is ginormous. So much so, that it crashes my session.
Question: is there a way to randomly subsample a parquet file before loading it with collect?

It turns out that there is an example in the documentation that pretty much fulfils my goal. That example is a smidge dated, as it uses sample_frac which has been superseded rather than slice_sample, but the general principle holds so I've updated it here. As I don't know how many batches there will be, here I show how it can be done with proportions, like Pace suggested, instead of pulling a fixed number of columns.
One issue with this approach is that (as far as I understand) it does require that the entire dataset is read in, it just does it in batches rather than in one go.
open_dataset("~/mtcars") %>%
map_batches(~ as_record_batch(slice_sample(as.data.frame(.), prop = 0.1))) %>%
collect()
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# 2 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
# 3 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4

Related

%>% .$column_name equivalent for R base pipe |>

I frequently use the dplyr piping to get a column from a tibble into a vector as below
iris %>% .$Sepal.Length
iris %>% .$Sepal.Length %>% cut(5)
How can I do the same using the latest R built-in pipe symbol |>
iris |> .$Sepal.Length
iris |> .$Sepal.Length |> cut(5)
Error: function '$' not supported in RHS call of a pipe
We can use getElement().
iris |> getElement('Sepal.Length') |> cut(5)
In base pipe no placeholder is provided for the data that is passed in the pipe. This is one difference between magrittr pipe and base R pipe. You may use an anonymous function to access the object.
iris |> {\(x) x$Sepal.Length}()
The direct usage of $ in |> is currently disabled. If the call of $ or other disabled functions in |> is still needed, an option, beside the creation of a function is to use $ via the function :: as base::`$` or place it in brakes ($):
iris |> (`$`)("Sepal.Length")
iris |> base::`$`("Sepal.Length")
iris |> (\(.) .$Sepal.Length)()
fun <- `$`
iris |> fun(Sepal.Length)
This will also work in cases where more than one column will be extracted.
iris |> (`[`)(c("Sepal.Length", "Petal.Length"))
Another option can be the use of a bizarro pipe ->.;. Some call it a joke others clever use of existing syntax.
iris ->.; .$Sepal.Length
This creates or overwrites . in the .GlobalEnv. rm(.) can be used to remove it. Alternatively it could be processed in local:
local({iris ->.; .$Sepal.Length})
In this case it produces two same objects in the environment iris and . but as long as they are not modified they point the the same address.
tracemem(iris)
#[1] "<0x556871bab148>"
tracemem(.)
#[1] "<0x556871bab148>"
|> is used as a pipe operator in R.
The left-hand side expression lhs is inserted as the first free argument in the call of to the right-hand side expression rhs.
mtcars |> head() # same as head(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars |> head(2) # same as head(mtcars, 2)
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
#Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
It is also possible to use a named argument with the placeholder _ in the rhs call to specify where the lhs is to be inserted. The placeholder can only appear once on the rhs.
mtcars |> lm(mpg ~ disp, data = _)
#mtcars |> lm(mpg ~ disp, _) #Error: pipe placeholder can only be used as a named argument
#Call:
#lm(formula = mpg ~ disp, data = mtcars)
#
#Coefficients:
#(Intercept) disp
# 29.59985 -0.04122
Alternatively explicitly name the argument(s) before the "one":
mtcars |> lm(formula = mpg ~ disp)
In case the placeholder is used more than once or used as a named or also unnamed argument on any position or for disabled functions: Use an (anonymous) function.
mtcars |> (\(.) .[.$cyl == 6,])()
#mtcars ->.; .[.$cyl == 6,] # Alternative using bizarro pipe
#local(mtcars ->.; .[.$cyl == 6,]) # Without overwriting and keeping .
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
mtcars |> (\(.) lm(mpg ~ disp, .))()
#Call:
#lm(formula = mpg ~ disp, data = .)
#
#Coefficients:
#(Intercept) disp
# 29.59985 -0.04122
An expression written as x |> f(y) is parsed as f(x, y). While the code in a pipeline is written sequentially, regular R semantics for evaluation apply. So piped expressions will be evaluated only when first used in the rhs expression.
Interesting example and great answers, let me add another version: I use usually selectand then unlist in such cases. This follows the "speaking R" paradigm and works same with both operators %>% and |>:
library("dplyr")
iris %>% select(Sepal.Length) %>% unlist() %>% cut(5)
iris |> select(Sepal.Length) |> unlist() |> cut(5)
Note that select is from dplyr and pull brought in from #jpdugo17 is even better.
If we use usual "base R" indexing, it is also short and works in both worlds:
iris[["Sepal.Length"]] |> cut(5)
iris$Sepal.Length |> cut(5)
and thanks to the comment of #zx8754 one can of course also use base R without any pipes
cut(iris$Sepal.Length, 5)
... but I think that the OP just wanted to point out differences in piping. I guess that it is to be applied in a bigger context and iris is only an example.
This is also an option:
iris |> dplyr::pull(Sepal.Length) |> cut(5)
Edit:
I wonder why calling a function with backticks isn't allowed.
iris |> `[`(, 'Sepal.Length')
#>Error: function '[' not supported in RHS call of a pipe
As pointed out by #Hugh, backticks are allowed but some functions are not.
Here's the blacklisted functions list extracted from wch Github
"if", "while", "repeat", "for", "break", "next", "return", "function",
"(", "{",
"+", "-", "*", "/", "^", "%%", "%/%", "%*%", ":", "::", ":::", "?", "|>",
"~", "#", "=>",
"==", "!=", "<", ">", "<=", ">=",
"&", "|", "&&", "||", "!",
"<-", "<<-", "=",
"$", "[", "[[",
"$<-", "[<-", "[[<-",
0
I know this question is closed. Other Base R solutions where we use symbol name instead of the character name might include:
iris |>
with(Sepal.Length)
iris |>
subset(select = Sepal.Length)
Since R 4.2.0, you can use _ as a placeholder for |>. Because "functions in rhs calls [can] not be syntactically special", you cannot use $ directly, so you have to define the function with another name first, and then use the placeholder and the column name:
set <- `$`
iris |> set(x = _, Sepal.Length)

Selecting rows with partial matching where a column has a string not working for decimals

For example, if I want to keep only those rows of the data mtcars where the variable qsec contains this decimal .50, following the solutions given here, I use:
mtcars_stringed<-mtcars%>%filter(str_detect(qsec, ".50"))
mtcars_stringed<-mtcars[mtcars$qsec %like% ".50", ]
mtcars_stringed <- mtcars[grep(".50", mtcars$qsec), ]
View(mtcars_stringed)
Surprisingly, all these strategies fail, by returning null, while in fact mtcars$qsec has values containing .50 such as 14.50, 15.50,
Any alternative solution, or is there something I am missing? Thanks in advance.
When you treat a numeric as a string, it is converted as.character(mtcars$qsec). If you look at that, you'll see that in the conversion, trailing 0s are dropped, so we get, e.g., "14.5", "15.5".
It will work if you use the regex pattern "\\.5$", \\ to make the . a ., not just "any character", and $ to match the end of the string.
mtcars %>% filter(str_detect(qsec, "\\.5$"))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
# 2 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
However, in general, treating decimals as strings can be risky. A better approach might to get rid of the integer with %% 1 and then test for nearness to 0.5 within some tolerance, this will avoid precision issues.
mtcars %>% filter(abs(qsec %% 1 - 0.5) < 1e-10)
You are probably looking for:
mtcars %>%
filter(qsec %% 0.50 == 0 & qsec %% 1 != 0)
mpg cyl disp hp drat wt qsec vs am gear carb
1 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
2 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6

How to convert from matrix like data to panel Format

I am a very beginner in working with R. This question therefore can be considered as a basic one.
I am trying to convert data in matrix format to panel data format when A, B or C = 0 For example:
set.seed(0); mat <- matrix(sample(0:1, 16, replace=T), ncol=4, nrow=4)
colnames (mat) <- c("A", "B", "C", "D")
rownames (mat) <- c("1","2", "3", "4")
to a panel format like:
A 1
A 2
A 3
A 4
B 1
B 2
B 3
B 4
for every letter where variable "1"-"4" are 0.
I tried using the apply codes from the plyr package. Can someone provide me the right code and argument for letting R know that it should extract A, B, C or D if "1"=0 and repeat the same process for "2", "3" and "4" and that R puts the output underneath the former in a new dataframe?
I realized the above stated question is not clear enough. I therefore make it more clear by the hand of the mtcars dataset.
cars <- mtcars
In case of this dataset, the format I would like is:
Mazda RX4 | mpg | 21.0
Mazda RX4 | cyl | 6
Mazda RX4 | disp | 160.0
...
Mazda RX4 Wag | mpg | 21.0
Mazda RX4 Wag | cyl | 6
...
and so on.
A note: You keep refering to the rows as variables. Having your variables in a row is at the very least confusing if not straight out dangerous because people expect variables to be in a column!
If your variables are called "1",...,"4" then I assume A,...,D refers to your observations? This would be even more confusing...
If you are interessted in what makes data tidy you should read Hadley Wickhams's revealing article on tidy data.
EDIT:
Regarding your question:
Using the mtcars dataset and functions from the tidyr and dplyr package:
require(tidyr)
require(dplyr)
mtcars %>%
add_rownames() %>%
gather("id", "value", mpg:carb) %>%
arrange(rowname)
Source: local data frame [352 x 3]
rowname id value
(chr) (chr) (dbl)
1 AMC Javelin mpg 15.200
2 AMC Javelin cyl 8.000
3 AMC Javelin disp 304.000
4 AMC Javelin hp 150.000
5 AMC Javelin drat 3.150
6 AMC Javelin wt 3.435
7 AMC Javelin qsec 17.300
8 AMC Javelin vs 0.000
9 AMC Javelin am 0.000
10 AMC Javelin gear 3.000
.. ... ... ...
If you dont know the %>% operator (called pipe-operator) just read it as "and then".
For the mtcarexample this piece of code
library(data.table)
cars <- as.data.table(mtcars, keep.rownames = TRUE)
melt(cars, id.vars = "rn")[order(rn)]
will give
rn variable value
1: AMC Javelin mpg 15.20
2: AMC Javelin cyl 8.00
3: AMC Javelin disp 304.00
4: AMC Javelin hp 150.00
5: AMC Javelin drat 3.15
---
348: Volvo 142E qsec 18.60
349: Volvo 142E vs 1.00
350: Volvo 142E am 1.00
351: Volvo 142E gear 4.00
352: Volvo 142E carb 2.00
Note that mtcars is a data.frame not a matrix.
The solution for the matrix mat given in the Q is
melt(as.data.table(mat, keep.rownames = TRUE), id.vars = "rn")[value == 0][
order(variable, rn), .(variable, rn)]
which will return
rn variable value
1: A 2
2: A 3
3: B 2
4: C 3
5: C 4
6: D 1
7: D 3

Using 'mutate_' to sum a bunch of columns row-wise

In this blog post, Paul Hiemstra shows how to sum up two columns using dplyr::mutate_. Copy/paste-ing relevant parts:
library(lazyeval)
f = function(col1, col2, new_col_name) {
mutate_call = lazyeval::interp(~ a + b, a = as.name(col1), b = as.name(col2))
mtcars %>% mutate_(.dots = setNames(list(mutate_call), new_col_name))
}
allows one to then do:
head(f('wt', 'mpg', 'hahaaa'))
Great!
I followed up with a question (see comments) as to how one could extend this to a 100 columns, since it wasn't quite clear (to me) how one could do it without having to type all the names using the above method. Paul was kind enough to indulge me and provided this answer (thanks!):
# data
df = data.frame(matrix(1:100, 10, 10))
names(df) = LETTERS[1:10]
# answer
sum_all_rows = function(list_of_cols) {
summarise_calls = sapply(list_of_cols, function(col) {
lazyeval::interp(~col_name, col_name = as.name(col))
})
df %>% select_(.dots = summarise_calls) %>% mutate(ans1 = rowSums(.))
}
sum_all_rows(LETTERS[sample(1:10, 5)])
I'd like to improve this answer on these points:
The other columns are gone. I'd like to keep them.
It uses rowSums() which has to coerce the data.frame to a matrix which I'd like to avoid.
Also I'm not sure if the use of . within non-do() verbs is encouraged? Because . within mutate() doesn't seem to adapt to just those rows when used with group_by().
And most importantly, how can I do the same using mutate_() instead of mutate()?
I found this answer, which addresses point 1, but unfortunately, both dplyr answers use rowSums() along with mutate().
PS: I just read Hadley's comment under that answer. IIUC, 'reshape to long form + group by + sum + reshape to wide form' is the recommend dplyr way for these type of operations?
Here's a different approach:
library(dplyr); library(lazyeval)
f <- function(df, list_of_cols, new_col) {
df %>%
mutate_(.dots = ~Reduce(`+`, .[list_of_cols])) %>%
setNames(c(names(df), new_col))
}
head(f(mtcars, c("mpg", "cyl"), "x"))
# mpg cyl disp hp drat wt qsec vs am gear carb x
#1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 27.0
#2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 27.0
#3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 26.8
#4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 27.4
#5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 26.7
#6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 24.1
Regarding your points:
Other columns are kept
It doesn't use rowSums
You are specifically asking for a row-wise operation here so I'm not sure (yet) how a group_by could do any harm when using . inside mutate/mutate_
It makes use of mutate_

Sample a data frame with dplyr [duplicate]

This question already has answers here:
sample rows of subgroups from dataframe with dplyr
(4 answers)
Closed 9 years ago.
I can sample 10 rows from a data.frame like this:
mtcars[sample(1:32, 10),]
What is syntax for doing this with dplyr? This is what I tried:
library(dplyr)
filter(mtcars, sample(1:32, 10))
I believe you aren't really "filtering" in your example, you are just sampling rows.
In hadley´s words here is the purpose of the function:
filter() works similarly to subset() except that you can give it any number of filtering conditions which are joined together with & (not
&& which is easy to do accidentally!)
Here is an example with the mtcars dataset, as it's used in the introductory vignette
library(dplyr)
filter(mtcars, cyl == 8, wt < 3.5)
mpg cyl disp hp drat wt qsec vs am gear carb
1 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
2 15.2 8 304 150 3.15 3.435 17.30 0 0 3 2
3 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4
As a conclusion: filter is equivalen to subset(), not sample().
Figured out how to do it (although Josh O'Brien beat me to it):
filter(mtcars, rownames(mtcars) %in% sample(rownames(mtcars), 10, replace = F))

Resources