Sample df:
library(tidyverse)
iris <- iris[1:10,]
iris$testlag <- NA
iris[[1,"testlag"]] <- 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa 5
2 4.9 3.0 1.4 0.2 setosa NA
3 4.7 3.2 1.3 0.2 setosa NA
4 4.6 3.1 1.5 0.2 setosa NA
5 5.0 3.6 1.4 0.2 setosa NA
6 5.4 3.9 1.7 0.4 setosa NA
7 4.6 3.4 1.4 0.3 setosa NA
8 5.0 3.4 1.5 0.2 setosa NA
9 4.4 2.9 1.4 0.2 setosa NA
10 4.9 3.1 1.5 0.1 setosa NA
In the testlag column, I'm interesting in using dplyr::lag() to retrieve the previous value and add some column, for example Petal.Length to it. As I have only one initial value, each subsequent calculation requires it to work iteratively, so I thought something like mutate would work.
I first tried doing something like this:
iris %>% mutate_at("testlag", ~ lag(.) + Petal.Length)
But this removed the first value, and only gave a valid value for the second row and NAs for the rest. Intuitively I know why it's removing the first value, but I thought the nature of mutate would allow it to work for the rest of the values, so I don't know how to fix that.
Of course using base R I could something like:
for (idx in 2:nrow(iris)) {
iris[[idx, "testlag"]] <-
lag(iris$testlag)[idx] + iris[[idx, "Petal.Length"]]
}
But I would prefer to implement this in tidyverse syntax.
Edit: Desired output (from my for loop)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa 5.0
2 4.9 3.0 1.4 0.2 setosa 6.4
3 4.7 3.2 1.3 0.2 setosa 7.7
4 4.6 3.1 1.5 0.2 setosa 9.2
5 5.0 3.6 1.4 0.2 setosa 10.6
6 5.4 3.9 1.7 0.4 setosa 12.3
7 4.6 3.4 1.4 0.3 setosa 13.7
8 5.0 3.4 1.5 0.2 setosa 15.2
9 4.4 2.9 1.4 0.2 setosa 16.6
10 4.9 3.1 1.5 0.1 setosa 18.1
Does this work for you?
library(tidyverse)
library("data.table")
iris <- iris[1:10,]
iris$testlag <- NA
iris[[1,"testlag"]] <- 5
iris %>% mutate (testlag = lag(first(testlag) + cumsum(Petal.Length)))
Result:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa NA
2 4.9 3.0 1.4 0.2 setosa 6.4
3 4.7 3.2 1.3 0.2 setosa 7.8
4 4.6 3.1 1.5 0.2 setosa 9.1
5 5.0 3.6 1.4 0.2 setosa 10.6
6 5.4 3.9 1.7 0.4 setosa 12.0
7 4.6 3.4 1.4 0.3 setosa 13.7
8 5.0 3.4 1.5 0.2 setosa 15.1
9 4.4 2.9 1.4 0.2 setosa 16.6
10 4.9 3.1 1.5 0.1 setosa 18.0
Since technically there is no N-1 Petal length when N = 1, I left the first value of testlag NA. Do you really need it to be initial value? If you need, this will work:
iris %>% mutate (testlag = lag(first(testlag) + cumsum(Petal.Length), default=first(testlag)))
The function you're looking for is tidyr::fill
library(tidyverse)
iris <- iris[1:10,]
iris$testlag <- NA
iris[[1,"testlag"]] <- 5
iris %>% fill(testlag, .direction = "down")
# Note the default is 'down', but I included here for completeness
This takes the specified column (testlag in this case), and copies any values in that column to the values below. This also works if you have a value in a subset of the rows: it copies the value down until it reaches a new value, then it picks up with that one.
For example:
library(tidyverse)
iris <- iris[1:10,]
iris$testlag <- NA
iris[[1,"testlag"]] <- 5
iris[[5,"testlag"]] <- 10
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa 5
2 4.9 3.0 1.4 0.2 setosa NA
3 4.7 3.2 1.3 0.2 setosa NA
4 4.6 3.1 1.5 0.2 setosa NA
5 5.0 3.6 1.4 0.2 setosa 10
6 5.4 3.9 1.7 0.4 setosa NA
7 4.6 3.4 1.4 0.3 setosa NA
8 5.0 3.4 1.5 0.2 setosa NA
9 4.4 2.9 1.4 0.2 setosa NA
10 4.9 3.1 1.5 0.1 setosa NA
Applying this function...
iris %>% fill(testlag, .direction = "down")
Gives
Sepal.Length Sepal.Width Petal.Length Petal.Width Species testlag
1 5.1 3.5 1.4 0.2 setosa 5
2 4.9 3.0 1.4 0.2 setosa 5
3 4.7 3.2 1.3 0.2 setosa 5
4 4.6 3.1 1.5 0.2 setosa 5
5 5.0 3.6 1.4 0.2 setosa 10
6 5.4 3.9 1.7 0.4 setosa 10
7 4.6 3.4 1.4 0.3 setosa 10
8 5.0 3.4 1.5 0.2 setosa 10
9 4.4 2.9 1.4 0.2 setosa 10
10 4.9 3.1 1.5 0.1 setosa 10
Related
I have a dataframe and I want to Create a subset,< Frame>, of just the species variable and display the first five records. with R how can I subset?
there are 10 rows and 7 columns.one column is Species
netID- fishID - species- tl - wtag - scale
By select.
head(
select(dataframe, speceis)
)
Assuming your dataframe is called df you can subset with dplyr
library(dplyr)
df <- iris[1:10,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
newdf<-df %>% select(Species) %>%slice(1:5)
Here you are selecting species from your data frame and then using slice you can select the range of rows you need. The Output of newdf is
Species
1 setosa
2 setosa
3 setosa
4 setosa
5 setosa
I’d like to dynamically assign which columns to subtract from each other. I’ve read around and looks like I need to use all_of, and maybe across (How to subtract one column from multiple columns in a dataframe in R using dplyr, How to you use objects in dplyr filter?). I can get it working for one variable in a mutate phrase (e.g. mutate(y = all_of(x))), but I can’t seem to do even simple calculations using two. Here’s a simplified example of what I want to do:
var1 <- c("Sepal.Length")
var2 <- c("Sepal.Width")
result <- iris %>%
mutate(calculation = all_of(var1) - all_of(var2))
We may use .data to subset the column as a vector. The all_of/any_of are used along with across to loop across the columns
library(dplyr)
iris %>%
mutate(calculation = .data[[var1]] - .data[[var2]])%>%
head
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or may also use cur_data()
iris %>%
head %>%
mutate(calculation = cur_data()[[var1]] - cur_data()[[var2]])
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or another option is to pass both the variables in across, and then reduce with -
library(purrr)
iris %>%
head %>%
mutate(calculation = reduce(across(all_of(c(var1, var2))), `-`))
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or could convert to symbol and evaluate (!!)
iris %>%
head %>%
mutate(calculation = !! rlang::sym(var1) - !! rlang::sym(var2))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or if we want to use all_of in across, just subset the column with [[
iris %>%
head %>%
mutate(calculation = across(all_of(var1))[[1]] -
across(all_of(var2))[[1]])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
The reason we need to subset is because, across by default will update the original column when the .names is not present. The calculation will be a data.frame with a single column
out <- iris %>%
head %>%
mutate(calculation = across(all_of(var1)) -
across(all_of(var2)))
out
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
str(out)
data.frame': 6 obs. of 6 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1
$ calculation :'data.frame': 6 obs. of 1 variable:
..$ Sepal.Length: num 1.6 1.9 1.5 1.5 1.4 1.5
We could use get to access the variable values where the name of variable is stored in a string (thanks to akrun for assist):
iris %>%
mutate(calculation = get(var1) - get(var2))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
7 4.6 3.4 1.4 0.3 setosa 1.2
8 5 3.4 1.5 0.2 setosa 1.6
9 4.4 2.9 1.4 0.2 setosa 1.5
10 4.9 3.1 1.5 0.1 setosa 1.8
# ... with 140 more rows
This is a simplified version of the actual problem I'm dealing with. In this example, I'll be working with four columns, and the actual problem requires working with about 20-30 columns.
Consider the iris dataset. Suppose that I wanted to, for some reason, append new columns which would be equal to double the .Length and the .Width columns. With the following code, this would change the existing columns:
library(dplyr)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
df_iris <- iris %>% mutate(across(matches("(\\.)(Length|Width)"),
function(x) { x * 2 }))
head(df_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 10.2 7.0 2.8 0.4 setosa
2 9.8 6.0 2.8 0.4 setosa
3 9.4 6.4 2.6 0.4 setosa
4 9.2 6.2 3.0 0.4 setosa
5 10.0 7.2 2.8 0.4 setosa
6 10.8 7.8 3.4 0.8 setosa
However, instead, I would like to have this doubled calculation create NEW columns, say .Length.2 and .Width.2. One way this could be done is the following:
double <- function(x) {
x * 2
}
df_iris <- iris %>%
mutate(Sepal.Length.2 = double(Sepal.Length),
Sepal.Width.2 = double(Sepal.Width),
Petal.Length.2 = double(Petal.Length),
Petal.Width.2 = double(Petal.Width))
head(df_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length.2 Sepal.Width.2 Petal.Length.2 Petal.Width.2
1 5.1 3.5 1.4 0.2 setosa 10.2 7.0 2.8 0.4
2 4.9 3.0 1.4 0.2 setosa 9.8 6.0 2.8 0.4
3 4.7 3.2 1.3 0.2 setosa 9.4 6.4 2.6 0.4
4 4.6 3.1 1.5 0.2 setosa 9.2 6.2 3.0 0.4
5 5.0 3.6 1.4 0.2 setosa 10.0 7.2 2.8 0.4
6 5.4 3.9 1.7 0.4 setosa 10.8 7.8 3.4 0.8
Is there a way to do this in dplyr without:
relying on superseded/deprecated functions?
having to manually specify each column name?
We can use across (used dplyr 1.0.6 version)
library(dplyr)
df_iris <- iris %>%
mutate(across(where(is.numeric), double, .names = '{.col}.2'))
-output
head(df_iris, 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length.2 Sepal.Width.2 Petal.Length.2 Petal.Width.2
1 5.1 3.5 1.4 0.2 setosa 10.2 7.0 2.8 0.4
2 4.9 3.0 1.4 0.2 setosa 9.8 6.0 2.8 0.4
3 4.7 3.2 1.3 0.2 setosa 9.4 6.4 2.6 0.4
I have a data frame with 81 objects and 12 variables, including an ID for each object.
Further, I have a sorted(!) list of ID's.
Now, I want to sort my data frame after this specific list.
Can anyone make a simple example for that case?
I am a newbie, trying to learn.
Thanks in advance!
Quick example of my case:
ID City NR1 NR2
Dataframe1 = "11000", Berlin, (123,2), (532,1)
"02401", Hamburg, (435,2), (352,1)
"83329", München, (124,3), (125,2)
ID = list("02401", "83329", "11000")
Now, I want Dataframe1 to be sorted after the ID from the list.
You can arrange your dataframe using arrange().
An example:
The iris dataset, as is:
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
creating an external vector:
index<-sample(1:150)
Then you can sort your dataframe with that external vector:
head(arrange(iris, index))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 6.4 2.7 5.3 1.9 virginica
2 5.5 3.5 1.3 0.2 setosa
3 6.3 3.3 6.0 2.5 virginica
4 6.3 3.3 4.7 1.6 versicolor
5 4.9 2.5 4.5 1.7 virginica
6 5.7 2.8 4.5 1.3 versicolor
To arrange by a specific external vector that matches one of the variables, you can use match()
iris2<-head(iris)%>%mutate(ID=sample(1:150, 6))
> iris2
Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID
1 5.1 3.5 1.4 0.2 setosa 29
2 4.9 3.0 1.4 0.2 setosa 61
3 4.7 3.2 1.3 0.2 setosa 69
4 4.6 3.1 1.5 0.2 setosa 89
5 5.0 3.6 1.4 0.2 setosa 59
6 5.4 3.9 1.7 0.4 setosa 84
external_vector<-c(69,59,84,29,61,89)
arrange with match():
iris2[match(external_vector, iris2$ID),]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID
3 4.7 3.2 1.3 0.2 setosa 69
5 5.0 3.6 1.4 0.2 setosa 59
6 5.4 3.9 1.7 0.4 setosa 84
1 5.1 3.5 1.4 0.2 setosa 29
2 4.9 3.0 1.4 0.2 setosa 61
4 4.6 3.1 1.5 0.2 setosa 89
I'm trying to filter a data.frame with filter() function from the package dplyr. The main problem here is that I want to use a vector for the conditions.
For example
library(dplyr)
conditions <- c("Sepal.Width<3.2","Species==setosa")
DATA <- iris %>%
filter(conditions) #This doesnt work, of course.
Is there any function that would take
conditions <- c("Sepal.Width<3.2","Species==setosa")
as an input and give me
Sepal.Width<3.2 & Species==setosa
as an output? I though about using eval(parse...) with sapplyand maybe paste0() to add the &, but can't make it work.
Any help would be aprecciated.
There are multiple issues. First, you need to quote inside quotation for the second condition:
conditions <- c("Sepal.Width < 3.2", "Species == 'setosa'")
Then, you need to specify the association between the two conditions. Here, I assumed an &. Then you can use eval(parse(...)):
iris %>%
filter(eval(parse(text = paste(conditions, sep = "&"))))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
On the other hand, I think it is always important to quote #Martin Mächler to warn about the potential problems associated with this approach:
The (possibly) only connection is via parse(text = ....) and all good
R programmers should know that this is rarely an efficient or safe
means to construct expressions (or calls). Rather learn more about
substitute(), quote(), and possibly the power of using
do.call(substitute, ......).
Here is a way:
conditions <- c("Sepal.Width<3.2","Species=='setosa'")
# note the small change here: ↑ ↑
DATA <- iris %>%
filter(eval(parse(text = paste(conditions, collapse = "&"))))
> DATA
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.9 3.0 1.4 0.2 setosa
2 4.6 3.1 1.5 0.2 setosa
3 4.4 2.9 1.4 0.2 setosa
4 4.9 3.1 1.5 0.1 setosa
5 4.8 3.0 1.4 0.1 setosa
6 4.3 3.0 1.1 0.1 setosa
7 5.0 3.0 1.6 0.2 setosa
8 4.8 3.1 1.6 0.2 setosa
9 4.9 3.1 1.5 0.2 setosa
10 4.4 3.0 1.3 0.2 setosa
11 4.5 2.3 1.3 0.3 setosa
12 4.8 3.0 1.4 0.3 setosa
A tidyeval way would be to use rlang::parse_exprs().
library(dplyr)
conditions <- c("Sepal.Width < 3.2", "Species == 'setosa'")
iris %>%
filter( !!! rlang::parse_exprs(conditions))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.9 3.0 1.4 0.2 setosa
2 4.6 3.1 1.5 0.2 setosa
3 4.4 2.9 1.4 0.2 setosa
4 4.9 3.1 1.5 0.1 setosa
5 4.8 3.0 1.4 0.1 setosa
6 4.3 3.0 1.1 0.1 setosa
7 5.0 3.0 1.6 0.2 setosa
8 4.8 3.1 1.6 0.2 setosa
9 4.9 3.1 1.5 0.2 setosa
10 4.4 3.0 1.3 0.2 setosa
11 4.5 2.3 1.3 0.3 setosa
12 4.8 3.0 1.4 0.3 setosa