For this example, I'm going to use iris dataset built-in in R.
How can I avoid the copy and pasting of the syntax below to have the same output?
package
library(dplyr)
Input
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#2 4.9 3.0 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa
#5 5.0 3.6 1.4 0.2 setosa
#6 5.4 3.9 1.7 0.4 setosa
Manual Solution
I have to subset my dataset based on the name of the column names.
I know how to do this "manually" but it would require a lot of copying and pasting on my current dataset.
Sepal <- iris %>% select(contains("Sepal"))
Petal <- iris %>% select(contains("Petal"))
Output
head(Sepal)
# Sepal.Length Sepal.Width
# 1 5.1 3.5
# 2 4.9 3.0
# 3 4.7 3.2
# 4 4.6 3.1
# 5 5.0 3.6
# 6 5.4 3.9
head(Petal)
# Petal.Length Petal.Width
# 1 1.4 0.2
# 2 1.4 0.2
# 3 1.3 0.2
# 4 1.5 0.2
# 5 1.4 0.2
# 6 1.7 0.4
How can I automatize this process? I think I can use the purrr package here. But I couldn't find a way to do it.
You can use
library(tidyverse)
map(set_names(c("Sepal", "Petal")), ~ select(iris, starts_with(.x)))
output (head)
$Sepal
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
$Petal
Petal.Length Petal.Width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
5 1.4 0.2
6 1.7 0.4
An option is also to use split.default on the substring of column names to return a named list of data.frames
library(dplyr)
library(stringr)
head(iris) %>%
select(-Species) %>%
split.default(str_remove(names(.), "\\..*"))
$Petal
Petal.Length Petal.Width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
5 1.4 0.2
6 1.7 0.4
$Sepal
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
Related
I have a dataframe and I want to Create a subset,< Frame>, of just the species variable and display the first five records. with R how can I subset?
there are 10 rows and 7 columns.one column is Species
netID- fishID - species- tl - wtag - scale
By select.
head(
select(dataframe, speceis)
)
Assuming your dataframe is called df you can subset with dplyr
library(dplyr)
df <- iris[1:10,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
newdf<-df %>% select(Species) %>%slice(1:5)
Here you are selecting species from your data frame and then using slice you can select the range of rows you need. The Output of newdf is
Species
1 setosa
2 setosa
3 setosa
4 setosa
5 setosa
I’d like to dynamically assign which columns to subtract from each other. I’ve read around and looks like I need to use all_of, and maybe across (How to subtract one column from multiple columns in a dataframe in R using dplyr, How to you use objects in dplyr filter?). I can get it working for one variable in a mutate phrase (e.g. mutate(y = all_of(x))), but I can’t seem to do even simple calculations using two. Here’s a simplified example of what I want to do:
var1 <- c("Sepal.Length")
var2 <- c("Sepal.Width")
result <- iris %>%
mutate(calculation = all_of(var1) - all_of(var2))
We may use .data to subset the column as a vector. The all_of/any_of are used along with across to loop across the columns
library(dplyr)
iris %>%
mutate(calculation = .data[[var1]] - .data[[var2]])%>%
head
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or may also use cur_data()
iris %>%
head %>%
mutate(calculation = cur_data()[[var1]] - cur_data()[[var2]])
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or another option is to pass both the variables in across, and then reduce with -
library(purrr)
iris %>%
head %>%
mutate(calculation = reduce(across(all_of(c(var1, var2))), `-`))
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or could convert to symbol and evaluate (!!)
iris %>%
head %>%
mutate(calculation = !! rlang::sym(var1) - !! rlang::sym(var2))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or if we want to use all_of in across, just subset the column with [[
iris %>%
head %>%
mutate(calculation = across(all_of(var1))[[1]] -
across(all_of(var2))[[1]])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
The reason we need to subset is because, across by default will update the original column when the .names is not present. The calculation will be a data.frame with a single column
out <- iris %>%
head %>%
mutate(calculation = across(all_of(var1)) -
across(all_of(var2)))
out
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
str(out)
data.frame': 6 obs. of 6 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1
$ calculation :'data.frame': 6 obs. of 1 variable:
..$ Sepal.Length: num 1.6 1.9 1.5 1.5 1.4 1.5
We could use get to access the variable values where the name of variable is stored in a string (thanks to akrun for assist):
iris %>%
mutate(calculation = get(var1) - get(var2))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
7 4.6 3.4 1.4 0.3 setosa 1.2
8 5 3.4 1.5 0.2 setosa 1.6
9 4.4 2.9 1.4 0.2 setosa 1.5
10 4.9 3.1 1.5 0.1 setosa 1.8
# ... with 140 more rows
Sample df:
library(tidyverse)
iris <- iris[1:10,]
iris$testlag <- NA
iris[[1,"testlag"]] <- 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa 5
2 4.9 3.0 1.4 0.2 setosa NA
3 4.7 3.2 1.3 0.2 setosa NA
4 4.6 3.1 1.5 0.2 setosa NA
5 5.0 3.6 1.4 0.2 setosa NA
6 5.4 3.9 1.7 0.4 setosa NA
7 4.6 3.4 1.4 0.3 setosa NA
8 5.0 3.4 1.5 0.2 setosa NA
9 4.4 2.9 1.4 0.2 setosa NA
10 4.9 3.1 1.5 0.1 setosa NA
In the testlag column, I'm interesting in using dplyr::lag() to retrieve the previous value and add some column, for example Petal.Length to it. As I have only one initial value, each subsequent calculation requires it to work iteratively, so I thought something like mutate would work.
I first tried doing something like this:
iris %>% mutate_at("testlag", ~ lag(.) + Petal.Length)
But this removed the first value, and only gave a valid value for the second row and NAs for the rest. Intuitively I know why it's removing the first value, but I thought the nature of mutate would allow it to work for the rest of the values, so I don't know how to fix that.
Of course using base R I could something like:
for (idx in 2:nrow(iris)) {
iris[[idx, "testlag"]] <-
lag(iris$testlag)[idx] + iris[[idx, "Petal.Length"]]
}
But I would prefer to implement this in tidyverse syntax.
Edit: Desired output (from my for loop)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa 5.0
2 4.9 3.0 1.4 0.2 setosa 6.4
3 4.7 3.2 1.3 0.2 setosa 7.7
4 4.6 3.1 1.5 0.2 setosa 9.2
5 5.0 3.6 1.4 0.2 setosa 10.6
6 5.4 3.9 1.7 0.4 setosa 12.3
7 4.6 3.4 1.4 0.3 setosa 13.7
8 5.0 3.4 1.5 0.2 setosa 15.2
9 4.4 2.9 1.4 0.2 setosa 16.6
10 4.9 3.1 1.5 0.1 setosa 18.1
Does this work for you?
library(tidyverse)
library("data.table")
iris <- iris[1:10,]
iris$testlag <- NA
iris[[1,"testlag"]] <- 5
iris %>% mutate (testlag = lag(first(testlag) + cumsum(Petal.Length)))
Result:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa NA
2 4.9 3.0 1.4 0.2 setosa 6.4
3 4.7 3.2 1.3 0.2 setosa 7.8
4 4.6 3.1 1.5 0.2 setosa 9.1
5 5.0 3.6 1.4 0.2 setosa 10.6
6 5.4 3.9 1.7 0.4 setosa 12.0
7 4.6 3.4 1.4 0.3 setosa 13.7
8 5.0 3.4 1.5 0.2 setosa 15.1
9 4.4 2.9 1.4 0.2 setosa 16.6
10 4.9 3.1 1.5 0.1 setosa 18.0
Since technically there is no N-1 Petal length when N = 1, I left the first value of testlag NA. Do you really need it to be initial value? If you need, this will work:
iris %>% mutate (testlag = lag(first(testlag) + cumsum(Petal.Length), default=first(testlag)))
The function you're looking for is tidyr::fill
library(tidyverse)
iris <- iris[1:10,]
iris$testlag <- NA
iris[[1,"testlag"]] <- 5
iris %>% fill(testlag, .direction = "down")
# Note the default is 'down', but I included here for completeness
This takes the specified column (testlag in this case), and copies any values in that column to the values below. This also works if you have a value in a subset of the rows: it copies the value down until it reaches a new value, then it picks up with that one.
For example:
library(tidyverse)
iris <- iris[1:10,]
iris$testlag <- NA
iris[[1,"testlag"]] <- 5
iris[[5,"testlag"]] <- 10
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa 5
2 4.9 3.0 1.4 0.2 setosa NA
3 4.7 3.2 1.3 0.2 setosa NA
4 4.6 3.1 1.5 0.2 setosa NA
5 5.0 3.6 1.4 0.2 setosa 10
6 5.4 3.9 1.7 0.4 setosa NA
7 4.6 3.4 1.4 0.3 setosa NA
8 5.0 3.4 1.5 0.2 setosa NA
9 4.4 2.9 1.4 0.2 setosa NA
10 4.9 3.1 1.5 0.1 setosa NA
Applying this function...
iris %>% fill(testlag, .direction = "down")
Gives
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa 5
2 4.9 3.0 1.4 0.2 setosa 5
3 4.7 3.2 1.3 0.2 setosa 5
4 4.6 3.1 1.5 0.2 setosa 5
5 5.0 3.6 1.4 0.2 setosa 10
6 5.4 3.9 1.7 0.4 setosa 10
7 4.6 3.4 1.4 0.3 setosa 10
8 5.0 3.4 1.5 0.2 setosa 10
9 4.4 2.9 1.4 0.2 setosa 10
10 4.9 3.1 1.5 0.1 setosa 10
This question builds from the SO post found here
I am trying to extract a random sample of rows in a data frame using a nesting condition.
Using the following dummy dataset (modified from iris):
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 5.3 2.9 1.5 0.2 setosa
5 5.2 3.7 1.3 0.2 virginica
6 4.7 3.2 1.5 0.2 virginica
7 3.9 3.1 1.4 0.2 virginica
8 4.7 3.2 1.3 0.2 virginica
9 4.0 3.1 1.5 0.2 versicolor
10 5.0 3.6 1.4 0.2 versicolor
11 4.6 3.1 1.5 0.2 versicolor
12 5.0 3.6 1.5 0.2 versicolor
The code below works fine to take a simple sample of 2 rows:
iris[sample(nrow(iris), 2), ]
However, what I would like to do is to take a sample of 2 rows for each level of a specific variable. For example create a random sample of 2 rows for each level of the variable 'Species', like that:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
4 5.3 2.9 1.5 0.2 setosa
6 4.7 3.2 1.5 0.2 virginica
7 3.9 3.1 1.4 0.2 virginica
11 4.6 3.1 1.5 0.2 versicolor
12 5.0 3.6 1.5 0.2 versicolor
Thanks for your help!
Very easy with dplyr:
library(dplyr)
iris %>%
group_by(Species) %>%
sample_n(size = 2)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 4.6 3.4 1.4 0.3 setosa
# 2 5.2 3.5 1.5 0.2 setosa
# 3 6.5 2.8 4.6 1.5 versicolor
# 4 5.7 2.8 4.5 1.3 versicolor
# 5 5.8 2.8 5.1 2.4 virginica
# 6 7.7 2.6 6.9 2.3 virginica
You can group by as many columns as you'd like
CO2 %>% group_by(Type, Treatment) %>% sample_n(size = 2)
When splitting a dataframe with by, the 'by' variables are printed, but not retained as variables.
data(iris)
dflist <- by(iris[,1:4], iris[,"Species"], data.frame)
head(dflist[[1]])
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
Is it possible to retain the variable as a column var as below?
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
Or is there a better way to group the data by certain variables into a list object?
If you want to keep the sepecies column, then you just have to ask for it. Right now you are explicitly removing it by only selecting columns 1:4.
dflist <- by(iris[,1:5], iris[,"Species"], data.frame)
head(dflist[[1]])
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
or at this point, since you are just splitting the data and not applying a function
dflist <- split(iris, iris[,"Species"])
would work just as well.
split might do what you're looking for:
split(iris, iris$Species)
# $setosa
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# ...
# $versicolor
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 51 7.0 3.2 4.7 1.4 versicolor
# 52 6.4 3.2 4.5 1.5 versicolor
# 53 6.9 3.1 4.9 1.5 versicolor
# 54 5.5 2.3 4.0 1.3 versicolor
# 55 6.5 2.8 4.6 1.5 versicolor
# ...
Is this what you want?
species_list <- split(iris,iris$Species,drop=FALSE)