This question already has answers here:
sample rows of subgroups from dataframe with dplyr
(4 answers)
Closed 9 years ago.
I can sample 10 rows from a data.frame like this:
mtcars[sample(1:32, 10),]
What is syntax for doing this with dplyr? This is what I tried:
library(dplyr)
filter(mtcars, sample(1:32, 10))
I believe you aren't really "filtering" in your example, you are just sampling rows.
In hadley´s words here is the purpose of the function:
filter() works similarly to subset() except that you can give it any number of filtering conditions which are joined together with & (not
&& which is easy to do accidentally!)
Here is an example with the mtcars dataset, as it's used in the introductory vignette
library(dplyr)
filter(mtcars, cyl == 8, wt < 3.5)
mpg cyl disp hp drat wt qsec vs am gear carb
1 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
2 15.2 8 304 150 3.15 3.435 17.30 0 0 3 2
3 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4
As a conclusion: filter is equivalen to subset(), not sample().
Figured out how to do it (although Josh O'Brien beat me to it):
filter(mtcars, rownames(mtcars) %in% sample(rownames(mtcars), 10, replace = F))
Related
I am trying to find the range of variable lat for each other column containing occurence records e.g. 0,1,2,3 etc. where the record of occurrence is non-zero (range of lat where occurence >0). I've tried to subset the data for each column without rows with 0 individuals recorded but I can't get it to work.
i tried to extract the minimum and maximum of lat for each species column where the occurence was >0 using which.max/min:
allfreq$lat[which.min(allfreq$lat[allfreq$Fem.mad !=0])]
however the results made no sense in that the values were nowhere near the minimum and maximum I observed visually.
Using mtcars dataset
> sapply(mtcars,function(x){range(x[x!=0])})
mpg cyl disp hp drat wt qsec vs am gear carb
[1,] 10.4 4 71.1 52 2.76 1.513 14.5 1 1 3 1
[2,] 33.9 8 472.0 335 4.93 5.424 22.9 1 1 5 8
I want to randomly sample a dataset. If I already have that dataset loaded, I can do something like this:
library(dplyr)
set.seed(-1)
mtcars %>% slice_sample(n = 3)
# mpg cyl disp hp drat wt qsec vs am gear carb
# AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.3 0 0 3 2
# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.0 1 0 4 2
But my dataset is stored as a parquet file. As an example, I'll create a parquet from mtcars:
library(arrow)
# Create parquet file
write_dataset(mtcars, "~/mtcars", format = "parquet")
open_dataset("~/mtcars") %>%
slice_sample(n = 3) %>%
collect()
# Error in UseMethod("slice_sample") :
# no applicable method for 'slice_sample' applied to an object of class "c('FileSystemDataset', 'Dataset', 'ArrowObject', 'R6')"
Clearly, slice_sample isn't implemented for parquet files and neither is slice:
open_dataset("~/mtcars") %>% nrow() -> n
subsample <- sample(1:n, 3)
open_dataset("~/mtcars") %>%
slice(subsample) %>%
collect()
# Error in UseMethod("slice") :
# no applicable method for 'slice' applied to an object of class "c('FileSystemDataset', 'Dataset', 'ArrowObject', 'R6')"
Now, I know filter is implemented, so I tried that:
open_dataset("~/mtcars") %>%
filter(row_number() %in% subsample) %>%
collect()
# Error: Filter expression not supported for Arrow Datasets: row_number() %in% subsample
# Call collect() first to pull data into R.
(This also doesn't work if I create a filtering vector first, e.g., foo <- rep(FALSE, n); foo[subsample] <- TRUE and use that in filter.)
This error offers some helpful advice, though: collect the data and then subsample. The issue is that the file is ginormous. So much so, that it crashes my session.
Question: is there a way to randomly subsample a parquet file before loading it with collect?
It turns out that there is an example in the documentation that pretty much fulfils my goal. That example is a smidge dated, as it uses sample_frac which has been superseded rather than slice_sample, but the general principle holds so I've updated it here. As I don't know how many batches there will be, here I show how it can be done with proportions, like Pace suggested, instead of pulling a fixed number of columns.
One issue with this approach is that (as far as I understand) it does require that the entire dataset is read in, it just does it in batches rather than in one go.
open_dataset("~/mtcars") %>%
map_batches(~ as_record_batch(slice_sample(as.data.frame(.), prop = 0.1))) %>%
collect()
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# 2 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
# 3 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4
For example, if I want to keep only those rows of the data mtcars where the variable qsec contains this decimal .50, following the solutions given here, I use:
mtcars_stringed<-mtcars%>%filter(str_detect(qsec, ".50"))
mtcars_stringed<-mtcars[mtcars$qsec %like% ".50", ]
mtcars_stringed <- mtcars[grep(".50", mtcars$qsec), ]
View(mtcars_stringed)
Surprisingly, all these strategies fail, by returning null, while in fact mtcars$qsec has values containing .50 such as 14.50, 15.50,
Any alternative solution, or is there something I am missing? Thanks in advance.
When you treat a numeric as a string, it is converted as.character(mtcars$qsec). If you look at that, you'll see that in the conversion, trailing 0s are dropped, so we get, e.g., "14.5", "15.5".
It will work if you use the regex pattern "\\.5$", \\ to make the . a ., not just "any character", and $ to match the end of the string.
mtcars %>% filter(str_detect(qsec, "\\.5$"))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
# 2 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
However, in general, treating decimals as strings can be risky. A better approach might to get rid of the integer with %% 1 and then test for nearness to 0.5 within some tolerance, this will avoid precision issues.
mtcars %>% filter(abs(qsec %% 1 - 0.5) < 1e-10)
You are probably looking for:
mtcars %>%
filter(qsec %% 0.50 == 0 & qsec %% 1 != 0)
mpg cyl disp hp drat wt qsec vs am gear carb
1 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
2 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
I can find max values of rows, disp, hp in mtcars dataset using sapply function, which gives 472 335 respectively:
sapply(list(mtcars$disp,mtcars$hp), max, na.rm=TRUE)
Now I want cyl for these values, i.e. cyl of cars where maximum value of sapply(list(mtcars$disp,mtcars$hp),max,na.rm=TRUE) is found.
Which function should I be using? I tried unsuccessfully with which,rownames,colnames:
mtcars(which(sapply(list(mtcars$disp,mtcars$hp),max,na.rm=TRUE)))
rownames(which(sapply(list(mtcars$disp,mtcars$hp),max,na.rm=TRUE))))
mtcars$cyl(sapply(list(mtcars$disp,mtcars$hp),max,na.rm=TRUE))
library(dplyr)
filter(mtcars, hp==max(hp) | disp == max(disp))$cyl
And the data.table solution is:
require(data.table)
mtcars <- as.data.table(mtcars)
mtcars[hp==max(hp) | disp==max(disp)]
mpg cyl disp hp drat wt qsec vs am gear carb
1: 10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
2: 15.0 8 301 335 3.54 3.57 14.60 0 1 5 8
# if you want to get one column, e.g. 'cyl'
mtcars[hp==max(hp) | disp == max(disp), cyl]
[1] 8 8
# if you want to get several columns, do either of:
mtcars[hp==max(hp) | disp == max(disp), .(cyl,qsec)]
mtcars[hp==max(hp) | disp == max(disp), list(cyl,qsec)]
cyl qsec
1: 8 17.98
2: 8 14.60
I am a very beginner in working with R. This question therefore can be considered as a basic one.
I am trying to convert data in matrix format to panel data format when A, B or C = 0 For example:
set.seed(0); mat <- matrix(sample(0:1, 16, replace=T), ncol=4, nrow=4)
colnames (mat) <- c("A", "B", "C", "D")
rownames (mat) <- c("1","2", "3", "4")
to a panel format like:
A 1
A 2
A 3
A 4
B 1
B 2
B 3
B 4
for every letter where variable "1"-"4" are 0.
I tried using the apply codes from the plyr package. Can someone provide me the right code and argument for letting R know that it should extract A, B, C or D if "1"=0 and repeat the same process for "2", "3" and "4" and that R puts the output underneath the former in a new dataframe?
I realized the above stated question is not clear enough. I therefore make it more clear by the hand of the mtcars dataset.
cars <- mtcars
In case of this dataset, the format I would like is:
Mazda RX4 | mpg | 21.0
Mazda RX4 | cyl | 6
Mazda RX4 | disp | 160.0
...
Mazda RX4 Wag | mpg | 21.0
Mazda RX4 Wag | cyl | 6
...
and so on.
A note: You keep refering to the rows as variables. Having your variables in a row is at the very least confusing if not straight out dangerous because people expect variables to be in a column!
If your variables are called "1",...,"4" then I assume A,...,D refers to your observations? This would be even more confusing...
If you are interessted in what makes data tidy you should read Hadley Wickhams's revealing article on tidy data.
EDIT:
Regarding your question:
Using the mtcars dataset and functions from the tidyr and dplyr package:
require(tidyr)
require(dplyr)
mtcars %>%
add_rownames() %>%
gather("id", "value", mpg:carb) %>%
arrange(rowname)
Source: local data frame [352 x 3]
rowname id value
(chr) (chr) (dbl)
1 AMC Javelin mpg 15.200
2 AMC Javelin cyl 8.000
3 AMC Javelin disp 304.000
4 AMC Javelin hp 150.000
5 AMC Javelin drat 3.150
6 AMC Javelin wt 3.435
7 AMC Javelin qsec 17.300
8 AMC Javelin vs 0.000
9 AMC Javelin am 0.000
10 AMC Javelin gear 3.000
.. ... ... ...
If you dont know the %>% operator (called pipe-operator) just read it as "and then".
For the mtcarexample this piece of code
library(data.table)
cars <- as.data.table(mtcars, keep.rownames = TRUE)
melt(cars, id.vars = "rn")[order(rn)]
will give
rn variable value
1: AMC Javelin mpg 15.20
2: AMC Javelin cyl 8.00
3: AMC Javelin disp 304.00
4: AMC Javelin hp 150.00
5: AMC Javelin drat 3.15
---
348: Volvo 142E qsec 18.60
349: Volvo 142E vs 1.00
350: Volvo 142E am 1.00
351: Volvo 142E gear 4.00
352: Volvo 142E carb 2.00
Note that mtcars is a data.frame not a matrix.
The solution for the matrix mat given in the Q is
melt(as.data.table(mat, keep.rownames = TRUE), id.vars = "rn")[value == 0][
order(variable, rn), .(variable, rn)]
which will return
rn variable value
1: A 2
2: A 3
3: B 2
4: C 3
5: C 4
6: D 1
7: D 3