I would like to calculate the distance by group. Three classes in the data frame.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 versicolor
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 virginica
I have also calculated the row sum of each class
rsums = aggregate(iris$rsum, by=list(Class=iris$Species), FUN=sum)
and the outcome come is like
Class Centroid
1 setosa 1521.3
2 versicolor 2143.8
3 virginica 2571.0
So I need to subtract sum of each group to each row value of same group, to get the absolute difference, for example given below.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1-1521.3 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7-2143.8 3.2 1.3 0.2 versicolor
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4-2571.0 3.9 1.7 0.4 virginica
It think your question is a bit unclear. But does
iris$rsum <- rowSums(iris[,-5])
rsums <- aggregate(iris$rsum, by=list(Class=iris$Species), FUN=sum)
iris[,-(5:6)] <- iris[,-(5:6)] - rsums$x[iris$Species]
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species rsum
#1 -1516.2 -1517.8 -1519.9 -1521.1 setosa 30.6
#2 -1516.4 -1518.3 -1519.9 -1521.1 setosa 28.5
#3 -1516.6 -1518.1 -1520.0 -1521.1 setosa 28.2
#4 -1516.7 -1518.2 -1519.8 -1521.1 setosa 28.2
#5 -1516.3 -1517.7 -1519.9 -1521.1 setosa 30.6
#6 -1515.9 -1517.4 -1519.6 -1520.9 setosa 34.2
do what you want?
This utilizes the fact that iris$Species is a factor together with R's reuse rules when subtracting.
Related
I’d like to dynamically assign which columns to subtract from each other. I’ve read around and looks like I need to use all_of, and maybe across (How to subtract one column from multiple columns in a dataframe in R using dplyr, How to you use objects in dplyr filter?). I can get it working for one variable in a mutate phrase (e.g. mutate(y = all_of(x))), but I can’t seem to do even simple calculations using two. Here’s a simplified example of what I want to do:
var1 <- c("Sepal.Length")
var2 <- c("Sepal.Width")
result <- iris %>%
mutate(calculation = all_of(var1) - all_of(var2))
We may use .data to subset the column as a vector. The all_of/any_of are used along with across to loop across the columns
library(dplyr)
iris %>%
mutate(calculation = .data[[var1]] - .data[[var2]])%>%
head
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or may also use cur_data()
iris %>%
head %>%
mutate(calculation = cur_data()[[var1]] - cur_data()[[var2]])
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or another option is to pass both the variables in across, and then reduce with -
library(purrr)
iris %>%
head %>%
mutate(calculation = reduce(across(all_of(c(var1, var2))), `-`))
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or could convert to symbol and evaluate (!!)
iris %>%
head %>%
mutate(calculation = !! rlang::sym(var1) - !! rlang::sym(var2))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or if we want to use all_of in across, just subset the column with [[
iris %>%
head %>%
mutate(calculation = across(all_of(var1))[[1]] -
across(all_of(var2))[[1]])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
The reason we need to subset is because, across by default will update the original column when the .names is not present. The calculation will be a data.frame with a single column
out <- iris %>%
head %>%
mutate(calculation = across(all_of(var1)) -
across(all_of(var2)))
out
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
str(out)
data.frame': 6 obs. of 6 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1
$ calculation :'data.frame': 6 obs. of 1 variable:
..$ Sepal.Length: num 1.6 1.9 1.5 1.5 1.4 1.5
We could use get to access the variable values where the name of variable is stored in a string (thanks to akrun for assist):
iris %>%
mutate(calculation = get(var1) - get(var2))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
7 4.6 3.4 1.4 0.3 setosa 1.2
8 5 3.4 1.5 0.2 setosa 1.6
9 4.4 2.9 1.4 0.2 setosa 1.5
10 4.9 3.1 1.5 0.1 setosa 1.8
# ... with 140 more rows
This is a simplified version of the actual problem I'm dealing with. In this example, I'll be working with four columns, and the actual problem requires working with about 20-30 columns.
Consider the iris dataset. Suppose that I wanted to, for some reason, append new columns which would be equal to double the .Length and the .Width columns. With the following code, this would change the existing columns:
library(dplyr)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
df_iris <- iris %>% mutate(across(matches("(\\.)(Length|Width)"),
function(x) { x * 2 }))
head(df_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 10.2 7.0 2.8 0.4 setosa
2 9.8 6.0 2.8 0.4 setosa
3 9.4 6.4 2.6 0.4 setosa
4 9.2 6.2 3.0 0.4 setosa
5 10.0 7.2 2.8 0.4 setosa
6 10.8 7.8 3.4 0.8 setosa
However, instead, I would like to have this doubled calculation create NEW columns, say .Length.2 and .Width.2. One way this could be done is the following:
double <- function(x) {
x * 2
}
df_iris <- iris %>%
mutate(Sepal.Length.2 = double(Sepal.Length),
Sepal.Width.2 = double(Sepal.Width),
Petal.Length.2 = double(Petal.Length),
Petal.Width.2 = double(Petal.Width))
head(df_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length.2 Sepal.Width.2 Petal.Length.2 Petal.Width.2
1 5.1 3.5 1.4 0.2 setosa 10.2 7.0 2.8 0.4
2 4.9 3.0 1.4 0.2 setosa 9.8 6.0 2.8 0.4
3 4.7 3.2 1.3 0.2 setosa 9.4 6.4 2.6 0.4
4 4.6 3.1 1.5 0.2 setosa 9.2 6.2 3.0 0.4
5 5.0 3.6 1.4 0.2 setosa 10.0 7.2 2.8 0.4
6 5.4 3.9 1.7 0.4 setosa 10.8 7.8 3.4 0.8
Is there a way to do this in dplyr without:
relying on superseded/deprecated functions?
having to manually specify each column name?
We can use across (used dplyr 1.0.6 version)
library(dplyr)
df_iris <- iris %>%
mutate(across(where(is.numeric), double, .names = '{.col}.2'))
-output
head(df_iris, 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length.2 Sepal.Width.2 Petal.Length.2 Petal.Width.2
1 5.1 3.5 1.4 0.2 setosa 10.2 7.0 2.8 0.4
2 4.9 3.0 1.4 0.2 setosa 9.8 6.0 2.8 0.4
3 4.7 3.2 1.3 0.2 setosa 9.4 6.4 2.6 0.4
I have a data frame with 81 objects and 12 variables, including an ID for each object.
Further, I have a sorted(!) list of ID's.
Now, I want to sort my data frame after this specific list.
Can anyone make a simple example for that case?
I am a newbie, trying to learn.
Thanks in advance!
Quick example of my case:
ID City NR1 NR2
Dataframe1 = "11000", Berlin, (123,2), (532,1)
"02401", Hamburg, (435,2), (352,1)
"83329", München, (124,3), (125,2)
ID = list("02401", "83329", "11000")
Now, I want Dataframe1 to be sorted after the ID from the list.
You can arrange your dataframe using arrange().
An example:
The iris dataset, as is:
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
creating an external vector:
index<-sample(1:150)
Then you can sort your dataframe with that external vector:
head(arrange(iris, index))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 6.4 2.7 5.3 1.9 virginica
2 5.5 3.5 1.3 0.2 setosa
3 6.3 3.3 6.0 2.5 virginica
4 6.3 3.3 4.7 1.6 versicolor
5 4.9 2.5 4.5 1.7 virginica
6 5.7 2.8 4.5 1.3 versicolor
To arrange by a specific external vector that matches one of the variables, you can use match()
iris2<-head(iris)%>%mutate(ID=sample(1:150, 6))
> iris2
Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID
1 5.1 3.5 1.4 0.2 setosa 29
2 4.9 3.0 1.4 0.2 setosa 61
3 4.7 3.2 1.3 0.2 setosa 69
4 4.6 3.1 1.5 0.2 setosa 89
5 5.0 3.6 1.4 0.2 setosa 59
6 5.4 3.9 1.7 0.4 setosa 84
external_vector<-c(69,59,84,29,61,89)
arrange with match():
iris2[match(external_vector, iris2$ID),]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID
3 4.7 3.2 1.3 0.2 setosa 69
5 5.0 3.6 1.4 0.2 setosa 59
6 5.4 3.9 1.7 0.4 setosa 84
1 5.1 3.5 1.4 0.2 setosa 29
2 4.9 3.0 1.4 0.2 setosa 61
4 4.6 3.1 1.5 0.2 setosa 89
This question builds from the SO post found here
I am trying to extract a random sample of rows in a data frame using a nesting condition.
Using the following dummy dataset (modified from iris):
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 5.3 2.9 1.5 0.2 setosa
5 5.2 3.7 1.3 0.2 virginica
6 4.7 3.2 1.5 0.2 virginica
7 3.9 3.1 1.4 0.2 virginica
8 4.7 3.2 1.3 0.2 virginica
9 4.0 3.1 1.5 0.2 versicolor
10 5.0 3.6 1.4 0.2 versicolor
11 4.6 3.1 1.5 0.2 versicolor
12 5.0 3.6 1.5 0.2 versicolor
The code below works fine to take a simple sample of 2 rows:
iris[sample(nrow(iris), 2), ]
However, what I would like to do is to take a sample of 2 rows for each level of a specific variable. For example create a random sample of 2 rows for each level of the variable 'Species', like that:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
4 5.3 2.9 1.5 0.2 setosa
6 4.7 3.2 1.5 0.2 virginica
7 3.9 3.1 1.4 0.2 virginica
11 4.6 3.1 1.5 0.2 versicolor
12 5.0 3.6 1.5 0.2 versicolor
Thanks for your help!
Very easy with dplyr:
library(dplyr)
iris %>%
group_by(Species) %>%
sample_n(size = 2)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 4.6 3.4 1.4 0.3 setosa
# 2 5.2 3.5 1.5 0.2 setosa
# 3 6.5 2.8 4.6 1.5 versicolor
# 4 5.7 2.8 4.5 1.3 versicolor
# 5 5.8 2.8 5.1 2.4 virginica
# 6 7.7 2.6 6.9 2.3 virginica
You can group by as many columns as you'd like
CO2 %>% group_by(Type, Treatment) %>% sample_n(size = 2)
When splitting a dataframe with by, the 'by' variables are printed, but not retained as variables.
data(iris)
dflist <- by(iris[,1:4], iris[,"Species"], data.frame)
head(dflist[[1]])
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
Is it possible to retain the variable as a column var as below?
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
Or is there a better way to group the data by certain variables into a list object?
If you want to keep the sepecies column, then you just have to ask for it. Right now you are explicitly removing it by only selecting columns 1:4.
dflist <- by(iris[,1:5], iris[,"Species"], data.frame)
head(dflist[[1]])
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
or at this point, since you are just splitting the data and not applying a function
dflist <- split(iris, iris[,"Species"])
would work just as well.
split might do what you're looking for:
split(iris, iris$Species)
# $setosa
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# ...
# $versicolor
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 51 7.0 3.2 4.7 1.4 versicolor
# 52 6.4 3.2 4.5 1.5 versicolor
# 53 6.9 3.1 4.9 1.5 versicolor
# 54 5.5 2.3 4.0 1.3 versicolor
# 55 6.5 2.8 4.6 1.5 versicolor
# ...
Is this what you want?
species_list <- split(iris,iris$Species,drop=FALSE)