This question builds from the SO post found here
I am trying to extract a random sample of rows in a data frame using a nesting condition.
Using the following dummy dataset (modified from iris):
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 5.3 2.9 1.5 0.2 setosa
5 5.2 3.7 1.3 0.2 virginica
6 4.7 3.2 1.5 0.2 virginica
7 3.9 3.1 1.4 0.2 virginica
8 4.7 3.2 1.3 0.2 virginica
9 4.0 3.1 1.5 0.2 versicolor
10 5.0 3.6 1.4 0.2 versicolor
11 4.6 3.1 1.5 0.2 versicolor
12 5.0 3.6 1.5 0.2 versicolor
The code below works fine to take a simple sample of 2 rows:
iris[sample(nrow(iris), 2), ]
However, what I would like to do is to take a sample of 2 rows for each level of a specific variable. For example create a random sample of 2 rows for each level of the variable 'Species', like that:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
4 5.3 2.9 1.5 0.2 setosa
6 4.7 3.2 1.5 0.2 virginica
7 3.9 3.1 1.4 0.2 virginica
11 4.6 3.1 1.5 0.2 versicolor
12 5.0 3.6 1.5 0.2 versicolor
Thanks for your help!
Very easy with dplyr:
library(dplyr)
iris %>%
group_by(Species) %>%
sample_n(size = 2)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 4.6 3.4 1.4 0.3 setosa
# 2 5.2 3.5 1.5 0.2 setosa
# 3 6.5 2.8 4.6 1.5 versicolor
# 4 5.7 2.8 4.5 1.3 versicolor
# 5 5.8 2.8 5.1 2.4 virginica
# 6 7.7 2.6 6.9 2.3 virginica
You can group by as many columns as you'd like
CO2 %>% group_by(Type, Treatment) %>% sample_n(size = 2)
Related
I’d like to dynamically assign which columns to subtract from each other. I’ve read around and looks like I need to use all_of, and maybe across (How to subtract one column from multiple columns in a dataframe in R using dplyr, How to you use objects in dplyr filter?). I can get it working for one variable in a mutate phrase (e.g. mutate(y = all_of(x))), but I can’t seem to do even simple calculations using two. Here’s a simplified example of what I want to do:
var1 <- c("Sepal.Length")
var2 <- c("Sepal.Width")
result <- iris %>%
mutate(calculation = all_of(var1) - all_of(var2))
We may use .data to subset the column as a vector. The all_of/any_of are used along with across to loop across the columns
library(dplyr)
iris %>%
mutate(calculation = .data[[var1]] - .data[[var2]])%>%
head
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or may also use cur_data()
iris %>%
head %>%
mutate(calculation = cur_data()[[var1]] - cur_data()[[var2]])
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or another option is to pass both the variables in across, and then reduce with -
library(purrr)
iris %>%
head %>%
mutate(calculation = reduce(across(all_of(c(var1, var2))), `-`))
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or could convert to symbol and evaluate (!!)
iris %>%
head %>%
mutate(calculation = !! rlang::sym(var1) - !! rlang::sym(var2))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or if we want to use all_of in across, just subset the column with [[
iris %>%
head %>%
mutate(calculation = across(all_of(var1))[[1]] -
across(all_of(var2))[[1]])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
The reason we need to subset is because, across by default will update the original column when the .names is not present. The calculation will be a data.frame with a single column
out <- iris %>%
head %>%
mutate(calculation = across(all_of(var1)) -
across(all_of(var2)))
out
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
str(out)
data.frame': 6 obs. of 6 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1
$ calculation :'data.frame': 6 obs. of 1 variable:
..$ Sepal.Length: num 1.6 1.9 1.5 1.5 1.4 1.5
We could use get to access the variable values where the name of variable is stored in a string (thanks to akrun for assist):
iris %>%
mutate(calculation = get(var1) - get(var2))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
7 4.6 3.4 1.4 0.3 setosa 1.2
8 5 3.4 1.5 0.2 setosa 1.6
9 4.4 2.9 1.4 0.2 setosa 1.5
10 4.9 3.1 1.5 0.1 setosa 1.8
# ... with 140 more rows
This is a simplified version of the actual problem I'm dealing with. In this example, I'll be working with four columns, and the actual problem requires working with about 20-30 columns.
Consider the iris dataset. Suppose that I wanted to, for some reason, append new columns which would be equal to double the .Length and the .Width columns. With the following code, this would change the existing columns:
library(dplyr)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
df_iris <- iris %>% mutate(across(matches("(\\.)(Length|Width)"),
function(x) { x * 2 }))
head(df_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 10.2 7.0 2.8 0.4 setosa
2 9.8 6.0 2.8 0.4 setosa
3 9.4 6.4 2.6 0.4 setosa
4 9.2 6.2 3.0 0.4 setosa
5 10.0 7.2 2.8 0.4 setosa
6 10.8 7.8 3.4 0.8 setosa
However, instead, I would like to have this doubled calculation create NEW columns, say .Length.2 and .Width.2. One way this could be done is the following:
double <- function(x) {
x * 2
}
df_iris <- iris %>%
mutate(Sepal.Length.2 = double(Sepal.Length),
Sepal.Width.2 = double(Sepal.Width),
Petal.Length.2 = double(Petal.Length),
Petal.Width.2 = double(Petal.Width))
head(df_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length.2 Sepal.Width.2 Petal.Length.2 Petal.Width.2
1 5.1 3.5 1.4 0.2 setosa 10.2 7.0 2.8 0.4
2 4.9 3.0 1.4 0.2 setosa 9.8 6.0 2.8 0.4
3 4.7 3.2 1.3 0.2 setosa 9.4 6.4 2.6 0.4
4 4.6 3.1 1.5 0.2 setosa 9.2 6.2 3.0 0.4
5 5.0 3.6 1.4 0.2 setosa 10.0 7.2 2.8 0.4
6 5.4 3.9 1.7 0.4 setosa 10.8 7.8 3.4 0.8
Is there a way to do this in dplyr without:
relying on superseded/deprecated functions?
having to manually specify each column name?
We can use across (used dplyr 1.0.6 version)
library(dplyr)
df_iris <- iris %>%
mutate(across(where(is.numeric), double, .names = '{.col}.2'))
-output
head(df_iris, 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length.2 Sepal.Width.2 Petal.Length.2 Petal.Width.2
1 5.1 3.5 1.4 0.2 setosa 10.2 7.0 2.8 0.4
2 4.9 3.0 1.4 0.2 setosa 9.8 6.0 2.8 0.4
3 4.7 3.2 1.3 0.2 setosa 9.4 6.4 2.6 0.4
I have a data frame with 81 objects and 12 variables, including an ID for each object.
Further, I have a sorted(!) list of ID's.
Now, I want to sort my data frame after this specific list.
Can anyone make a simple example for that case?
I am a newbie, trying to learn.
Thanks in advance!
Quick example of my case:
ID City NR1 NR2
Dataframe1 = "11000", Berlin, (123,2), (532,1)
"02401", Hamburg, (435,2), (352,1)
"83329", München, (124,3), (125,2)
ID = list("02401", "83329", "11000")
Now, I want Dataframe1 to be sorted after the ID from the list.
You can arrange your dataframe using arrange().
An example:
The iris dataset, as is:
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
creating an external vector:
index<-sample(1:150)
Then you can sort your dataframe with that external vector:
head(arrange(iris, index))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 6.4 2.7 5.3 1.9 virginica
2 5.5 3.5 1.3 0.2 setosa
3 6.3 3.3 6.0 2.5 virginica
4 6.3 3.3 4.7 1.6 versicolor
5 4.9 2.5 4.5 1.7 virginica
6 5.7 2.8 4.5 1.3 versicolor
To arrange by a specific external vector that matches one of the variables, you can use match()
iris2<-head(iris)%>%mutate(ID=sample(1:150, 6))
> iris2
Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID
1 5.1 3.5 1.4 0.2 setosa 29
2 4.9 3.0 1.4 0.2 setosa 61
3 4.7 3.2 1.3 0.2 setosa 69
4 4.6 3.1 1.5 0.2 setosa 89
5 5.0 3.6 1.4 0.2 setosa 59
6 5.4 3.9 1.7 0.4 setosa 84
external_vector<-c(69,59,84,29,61,89)
arrange with match():
iris2[match(external_vector, iris2$ID),]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID
3 4.7 3.2 1.3 0.2 setosa 69
5 5.0 3.6 1.4 0.2 setosa 59
6 5.4 3.9 1.7 0.4 setosa 84
1 5.1 3.5 1.4 0.2 setosa 29
2 4.9 3.0 1.4 0.2 setosa 61
4 4.6 3.1 1.5 0.2 setosa 89
Sample df:
library(tidyverse)
iris <- iris[1:10,]
iris$testlag <- NA
iris[[1,"testlag"]] <- 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa 5
2 4.9 3.0 1.4 0.2 setosa NA
3 4.7 3.2 1.3 0.2 setosa NA
4 4.6 3.1 1.5 0.2 setosa NA
5 5.0 3.6 1.4 0.2 setosa NA
6 5.4 3.9 1.7 0.4 setosa NA
7 4.6 3.4 1.4 0.3 setosa NA
8 5.0 3.4 1.5 0.2 setosa NA
9 4.4 2.9 1.4 0.2 setosa NA
10 4.9 3.1 1.5 0.1 setosa NA
In the testlag column, I'm interesting in using dplyr::lag() to retrieve the previous value and add some column, for example Petal.Length to it. As I have only one initial value, each subsequent calculation requires it to work iteratively, so I thought something like mutate would work.
I first tried doing something like this:
iris %>% mutate_at("testlag", ~ lag(.) + Petal.Length)
But this removed the first value, and only gave a valid value for the second row and NAs for the rest. Intuitively I know why it's removing the first value, but I thought the nature of mutate would allow it to work for the rest of the values, so I don't know how to fix that.
Of course using base R I could something like:
for (idx in 2:nrow(iris)) {
iris[[idx, "testlag"]] <-
lag(iris$testlag)[idx] + iris[[idx, "Petal.Length"]]
}
But I would prefer to implement this in tidyverse syntax.
Edit: Desired output (from my for loop)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa 5.0
2 4.9 3.0 1.4 0.2 setosa 6.4
3 4.7 3.2 1.3 0.2 setosa 7.7
4 4.6 3.1 1.5 0.2 setosa 9.2
5 5.0 3.6 1.4 0.2 setosa 10.6
6 5.4 3.9 1.7 0.4 setosa 12.3
7 4.6 3.4 1.4 0.3 setosa 13.7
8 5.0 3.4 1.5 0.2 setosa 15.2
9 4.4 2.9 1.4 0.2 setosa 16.6
10 4.9 3.1 1.5 0.1 setosa 18.1
Does this work for you?
library(tidyverse)
library("data.table")
iris <- iris[1:10,]
iris$testlag <- NA
iris[[1,"testlag"]] <- 5
iris %>% mutate (testlag = lag(first(testlag) + cumsum(Petal.Length)))
Result:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa NA
2 4.9 3.0 1.4 0.2 setosa 6.4
3 4.7 3.2 1.3 0.2 setosa 7.8
4 4.6 3.1 1.5 0.2 setosa 9.1
5 5.0 3.6 1.4 0.2 setosa 10.6
6 5.4 3.9 1.7 0.4 setosa 12.0
7 4.6 3.4 1.4 0.3 setosa 13.7
8 5.0 3.4 1.5 0.2 setosa 15.1
9 4.4 2.9 1.4 0.2 setosa 16.6
10 4.9 3.1 1.5 0.1 setosa 18.0
Since technically there is no N-1 Petal length when N = 1, I left the first value of testlag NA. Do you really need it to be initial value? If you need, this will work:
iris %>% mutate (testlag = lag(first(testlag) + cumsum(Petal.Length), default=first(testlag)))
The function you're looking for is tidyr::fill
library(tidyverse)
iris <- iris[1:10,]
iris$testlag <- NA
iris[[1,"testlag"]] <- 5
iris %>% fill(testlag, .direction = "down")
# Note the default is 'down', but I included here for completeness
This takes the specified column (testlag in this case), and copies any values in that column to the values below. This also works if you have a value in a subset of the rows: it copies the value down until it reaches a new value, then it picks up with that one.
For example:
library(tidyverse)
iris <- iris[1:10,]
iris$testlag <- NA
iris[[1,"testlag"]] <- 5
iris[[5,"testlag"]] <- 10
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa 5
2 4.9 3.0 1.4 0.2 setosa NA
3 4.7 3.2 1.3 0.2 setosa NA
4 4.6 3.1 1.5 0.2 setosa NA
5 5.0 3.6 1.4 0.2 setosa 10
6 5.4 3.9 1.7 0.4 setosa NA
7 4.6 3.4 1.4 0.3 setosa NA
8 5.0 3.4 1.5 0.2 setosa NA
9 4.4 2.9 1.4 0.2 setosa NA
10 4.9 3.1 1.5 0.1 setosa NA
Applying this function...
iris %>% fill(testlag, .direction = "down")
Gives
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa 5
2 4.9 3.0 1.4 0.2 setosa 5
3 4.7 3.2 1.3 0.2 setosa 5
4 4.6 3.1 1.5 0.2 setosa 5
5 5.0 3.6 1.4 0.2 setosa 10
6 5.4 3.9 1.7 0.4 setosa 10
7 4.6 3.4 1.4 0.3 setosa 10
8 5.0 3.4 1.5 0.2 setosa 10
9 4.4 2.9 1.4 0.2 setosa 10
10 4.9 3.1 1.5 0.1 setosa 10
When splitting a dataframe with by, the 'by' variables are printed, but not retained as variables.
data(iris)
dflist <- by(iris[,1:4], iris[,"Species"], data.frame)
head(dflist[[1]])
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
Is it possible to retain the variable as a column var as below?
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
Or is there a better way to group the data by certain variables into a list object?
If you want to keep the sepecies column, then you just have to ask for it. Right now you are explicitly removing it by only selecting columns 1:4.
dflist <- by(iris[,1:5], iris[,"Species"], data.frame)
head(dflist[[1]])
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
or at this point, since you are just splitting the data and not applying a function
dflist <- split(iris, iris[,"Species"])
would work just as well.
split might do what you're looking for:
split(iris, iris$Species)
# $setosa
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# ...
# $versicolor
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 51 7.0 3.2 4.7 1.4 versicolor
# 52 6.4 3.2 4.5 1.5 versicolor
# 53 6.9 3.1 4.9 1.5 versicolor
# 54 5.5 2.3 4.0 1.3 versicolor
# 55 6.5 2.8 4.6 1.5 versicolor
# ...
Is this what you want?
species_list <- split(iris,iris$Species,drop=FALSE)