I am trying to create a table that includes the value of y for when x is equal to or less than a certain value, by group. Below is my code using the iris data set.
For "<=2.5", I expect to get 4.5, 5.0, or 5.8 for the virginica group, since these are the values of Petal.Length associated with a Sepal.Width of 2.5 for virginica. But instead, I get 6.0. Any ideas of where I went wrong? (My actual data set does not have duplicates of the variable analogous to Sepal.Width for the same group, so choosing among those is not an issue for me.)
data(iris)
my.table <- iris %>%
group_by(Species) %>%
summarise("<=2.5" = Petal.Length[which.max(Sepal.Width[Sepal.Width<=2.5])],
"<=3" = Petal.Length[which.max(Sepal.Width[Sepal.Width<=3])],
"<=3.5" = Petal.Length[which.max(Sepal.Width[Sepal.Width<=3.5])],
"<=4" = Petal.Length[which.max(Sepal.Width[Sepal.Width<=4])])
This is related to the question Create a table with values from ecdf graph
The problem is that you are first subsetting the Sepal.Width. Consequently, the index returned by which.max applies to that sub-vector, and no longer corresponds to the indices of the whole Petal.Length vector.
To fix this, you also need to subset Petal.Length correspondingly, e.g.
…
`<=2.5` = Petal.Length[Sepal.Width <= 2.5][which.max(Sepal.Width[Sepal.Width <= 2.5])],
…
… of course this gets rather verbose and repetitive. It might be better to perform the subsetting in a separate step. However, this means creating new columns for every threshold value.
Incidentally, this is unrelated to dplyr.
To make it more scalable, using double loop:
myCuts <- c(2.5, 3, 3.5, 4)
res <- sapply(split(iris, iris$Species), function(i)
sapply(myCuts, function(j){
x <- i[ i$Sepal.Width <= j, ]
x$Petal.Length[ which.max(x$Sepal.Width) ]
}))
rownames(res) <- paste0("<=", myCuts)
res
# setosa versicolor virginica
# <=2.5 1.3 3.9 4.5
# <=3 1.4 4.2 5.9
# <=3.5 1.4 4.5 5.6
# <=4 1.2 4.5 6.7
Here's another way to get the same data. Create a group variable according to Sepal.Width values. Then within each group, select the row with the top Sepal.Width value. It is in a different "shape", but you can always pivot_wider if you want all the values as columns instead of rows.
iris %>%
group_by(Species,
Sepal.Width_grp = case_when(Sepal.Width <= 2.5 ~ '<=2.5',
Sepal.Width <= 3 ~ '<=3',
Sepal.Width <= 3.5 ~ '<=3.5',
Sepal.Width <= 4 ~ '<=4',
TRUE ~ '> 4')) %>%
top_n(1, -Sepal.Width) %>%
select(Species, Sepal.Width_grp, Top.Sepal.Width = Sepal.Width, Petal.Width)
# # A tibble: 25 x 4
# # Groups: Species, Sepal.Width_grp [12]
# Species Sepal.Width_grp Top.Sepal.Width Petal.Width
# <fct> <chr> <dbl> <dbl>
# 1 setosa <=3.5 3.1 0.2
# 2 setosa <=4 3.6 0.2
# 3 setosa <=3 2.9 0.2
# 4 setosa <=3.5 3.1 0.1
# 5 setosa <=4 3.6 0.2
# 6 setosa <=3.5 3.1 0.2
# 7 setosa > 4 4.1 0.1
# 8 setosa <=3.5 3.1 0.2
# 9 setosa <=4 3.6 0.1
# 10 setosa <=2.5 2.3 0.3
# # ... with 15 more rows
Edit: A little simpler if you use cut
iris %>%
group_by(Species,
Sepal.Width_grp = cut(Sepal.Width, c(0, 2.5, 3, 3.5, 4, Inf))) %>%
top_n(1, -Sepal.Width) %>%
select(Species, Sepal.Width_grp, Top.Sepal.Width = Sepal.Width, Petal.Width)
# # A tibble: 25 x 4
# # Groups: Species, Sepal.Width_grp [12]
# Species Sepal.Width_grp Top.Sepal.Width Petal.Width
# <fct> <fct> <dbl> <dbl>
# 1 setosa (3,3.5] 3.1 0.2
# 2 setosa (3.5,4] 3.6 0.2
# 3 setosa (2.5,3] 2.9 0.2
# 4 setosa (3,3.5] 3.1 0.1
# 5 setosa (3.5,4] 3.6 0.2
# 6 setosa (3,3.5] 3.1 0.2
# 7 setosa (4,Inf] 4.1 0.1
# 8 setosa (3,3.5] 3.1 0.2
# 9 setosa (3.5,4] 3.6 0.1
# 10 setosa (0,2.5] 2.3 0.3
# # ... with 15 more rows
Related
Lets suppose if the data is
data <- head(iris)
How can i create a new column whose values will be derived from data$Sepal.Length in a way that if data$Sepal.Length is equal to or greater than 5, value will be 5 and if its less or equal to 3, value will be 3, else values should remain same...
I have tried
data %>% mutate(Sepal.Length = case_when(Sepal.Length <=3 ~ '3',Sepal.Length>=5 ~ '5'))
But it is giving NA to remaining values..
You can do this using a basic case_when statement:
data %>%
mutate(Sepal.Length = case_when(
Sepal.Length <= 3 ~ 3,
Sepal.Length >= 5 ~ 5,
TRUE ~ Sepal.Length))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5 3.6 1.4 0.2 setosa
# 6 5 3.9 1.7 0.4 setosa```
I have several dataframes for which I need to fix the classes of multiple columns, before I can proceed. Because the dataframes all have the same variables but the classes seemed to differ from one dataframe to the other, I figured I would go for a 'for loop'and specify the unique length upon which a column should be coded as factor or numeric.
I tried the following for factor:
dataframes <- list(dataframe1, dataframe2, dataframe2, dataframe3)
for (i in dataframes){
cols.to.factor <-sapply(i, function(col) length(unique(col)) < 6)
i[cols.to.factor] <- apply(i[cols.to.factor] , factor)
}
now the code runs, but it doesn't change anything. What am I missing?
Thanks for the help in advance!
The instruction
for(i in dataframes)
extracts i from the list dataframes and the loop changes the copy, that is never reassigned to the original. A way to correct the problem is
for (i in seq_along(dataframes)){
x <- dataframes[[i]]
cols.to.factor <-sapply(x, function(col) length(unique(col)) < 6)
x[cols.to.factor] <- lapply(x[cols.to.factor] , factor)
dataframes[[i]] <- x
}
An equivalent lapply based solution is
dataframes <- lapply(dataframes, \(x){
cols.to.factor <- sapply(x, function(col) length(unique(col)) < 6)
x[cols.to.factor] <- lapply(x[cols.to.factor], factor)
x
})
library(tidyverse)
# example data
list(
iris,
iris %>% mutate(Sepal.Length = Sepal.Length %>% as.character())
) %>%
# unify column classes
map(~ .x %>% mutate(across(everything(), as.character))) %>%
# optional joining if wished
bind_rows() %>%
mutate(Species = Species %>% as.factor()) %>%
as_tibble()
#> # A tibble: 300 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <chr> <chr> <chr> <chr> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 290 more rows
Created on 2021-10-05 by the reprex package (v2.0.1)
library(tidyverse)
df <- iris %>%
group_by(Species) %>%
mutate(Petal.Dim = Petal.Length * Petal.Width,
rank = rank(desc(Petal.Dim))) %>%
mutate(new_col = rank == 4, Sepal.Width)
table <- df %>%
filter(rank == 4) %>%
select(Species, new_col = Sepal.Width)
correct_df <- left_join(df, table, by = "Species")
df
#> # A tibble: 150 x 8
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Dim
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 0.280
#> 2 4.9 3 1.4 0.2 setosa 0.280
#> 3 4.7 3.2 1.3 0.2 setosa 0.26
#> 4 4.6 3.1 1.5 0.2 setosa 0.3
#> 5 5 3.6 1.4 0.2 setosa 0.280
#> 6 5.4 3.9 1.7 0.4 setosa 0.68
#> 7 4.6 3.4 1.4 0.3 setosa 0.42
#> 8 5 3.4 1.5 0.2 setosa 0.3
#> 9 4.4 2.9 1.4 0.2 setosa 0.280
#> 10 4.9 3.1 1.5 0.1 setosa 0.15
#> # ... with 140 more rows, and 2 more variables: rank <dbl>, new_col <lgl>
I'm basically looking for new_col to show the value that corresponds with rank = 4 from the Sepal.Width column. In this case, those values would be 3.9, 3.3, and 3.8. I'm envisioning this similar to a VLookup, or Index/Match in Excel.
When ever I think "now I need to use VLOOKUP like I did in the past in Excel" I find the left_join() function helpful. It's also part of the dplyr package. Instead of "looking up" values in one table in another table, it's easier for R to just make one bigger table where one table remains unchanged (here the "left" one or the first term you put in the function) and the other is added using a column or columns they have in common as an index.
In your specific example, I can't entirely understand what you want new_col to have in it. If you want to do Excel-style VLOOKUP in R, then left_join() is the best starting point.
The question is not clear since it does not mention the purpose of a Vlookup or Index/Match like operation from Excel.
Also, you don't mention what value should "new_col" have if rank is not equal to 4.
Assuming the value is NA, the below solution with a simple ifelse would work:
df <- iris %>%
group_by(Species) %>%
mutate(Petal.Dim = Petal.Length * Petal.Width,
rank = rank(desc(Petal.Dim))) %>%
ungroup() %>%
mutate(new_col = ifelse(rank == 4, Sepal.Width,NA))
df
I am trying to use a combination of mutate_at and which.max to manipulate a data frame as outlined below.
#This is basically what I want to achieve
df_want <- iris %>% group_by(Species) %>% mutate(Sepal.Length = Sepal.Length[which.max(Petal.Width)],
Sepal.Width = Sepal.Width[which.max(Petal.Width)])
#Here is my attempt at a smarter solution, but it does not work
df_attempt <- iris %>% group_by(Species) %>% mutate_at(c("Sepal.Length", "Sepal.Width"), function(x) x[which.max("Petal.Width")])
#However, this works
df_test <- iris %>% group_by(Species) %>% mutate_at(c("Sepal.Length", "Sepal.Width"), function(x) x + 100)
The code to produce df_attempt does not work. I get the following error message:
Error in mutate_impl(.data, dots) :
Column `Sepal.Length` must be length 50 (the group size) or one, not 0
Any ideas how I can get around this while still using mutate_at?
The standard dplyr way would be:
df_want <- iris %>%
group_by(Species) %>%
mutate(Sepal.Length = Sepal.Length[which.max(Petal.Width)],
Sepal.Width = Sepal.Width[which.max(Petal.Width)])
df_attempt <- iris %>%
group_by(Species) %>%
mutate_at(vars(Sepal.Length, Sepal.Width), funs(.[which.max(Petal.Width)]))
Result:
# A tibble: 150 x 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fctr>
1 5 3.5 1.4 0.2 setosa
2 5 3.5 1.4 0.2 setosa
3 5 3.5 1.3 0.2 setosa
4 5 3.5 1.5 0.2 setosa
5 5 3.5 1.4 0.2 setosa
6 5 3.5 1.7 0.4 setosa
7 5 3.5 1.4 0.3 setosa
8 5 3.5 1.5 0.2 setosa
9 5 3.5 1.4 0.2 setosa
10 5 3.5 1.5 0.1 setosa
# ... with 140 more rows
> identical(df_want, df_attempt)
[1] TRUE
Note:
With vars you can reference variables with NSE.
With funs you can reference each column with a ., which is equivalent to function(x) x
library(tidyverse)
iris <- iris
means <- iris %>%
group_by(Species) %>%
summarise_all(funs(mean))
sd <- iris %>%
group_by(Species) %>%
summarise_all(funs(sd))
bottom <- means[ ,2:5] - sd[ ,2:5]
bottom$Species <- c("setosa", "versicolor", "virginica")
print(bottom)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.653510 3.048936 1.288336 0.1406144 setosa
2 5.419829 2.456202 3.790089 1.1282473 versicolor
3 5.952120 2.651503 5.000105 1.7513499 virginica
top <- means[ ,2:5] + sd[ ,2:5]
top$Species <- c("setosa", "versicolor", "virginica")
print(top)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.358490 3.807064 1.635664 0.3513856 setosa
2 6.452171 3.083798 4.729911 1.5237527 versicolor
3 7.223880 3.296497 6.103895 2.3006501 virginica
How do I get the rows of Iris where the values for Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width all fall between the values in the top and bottom data frames?
For example, I only want setosa rows where Sepal.Length > 4.65 & Sepal.Length < 5.35 and Sepal.Width is between 3.04 and 3.80, etc. Ideally the end result contains only the 4 numeric columns and the species column.
Thanks.
It would be much easier if you can filter from the beginning without the summarize step:
iris %>%
group_by(Species) %>%
filter_if(is.numeric, all_vars(. < mean(.) + sd(.) & . > mean(.) - sd(.)))
# A tibble: 54 x 5
# Groups: Species [3]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.7 3.2 1.3 0.2 setosa
# 3 5.0 3.6 1.4 0.2 setosa
# 4 5.0 3.4 1.5 0.2 setosa
# 5 4.8 3.4 1.6 0.2 setosa
# 6 5.1 3.5 1.4 0.3 setosa
# 7 5.1 3.8 1.5 0.3 setosa
# 8 5.2 3.5 1.5 0.2 setosa
# 9 5.2 3.4 1.4 0.2 setosa
#10 4.7 3.2 1.6 0.2 setosa
# ... with 44 more rows
Not sure if you can avoid the summarize step, post as an option here.
Or use between:
iris %>%
group_by(Species) %>%
filter_if(is.numeric, all_vars(between(., mean(.) - sd(.), mean(.) + sd(.))))
Here is a solution using non-equi joins which is building on the (now deleted) approach of #Frank:
library(data.table)
# add a row number column and to reshape from wide to long
DT <- melt(data.table(iris)[, rn := .I], id = c("rn", "Species"))
# compute lower and upper bound for each variable and Species
mDT <- DT[, .(lb = lb <- mean(value) - (s <- sd(value)),
ub = lb + 2 * s), by = .(Species, variable)]
# find row numbers of items which fulfill conditions
selected_rn <-
# non-equi join
DT[DT[mDT, on = .(Species, variable, value > lb, value < ub), which = TRUE]][
# all uniqueN(mDT$variable) variables must have been selected
# for an item to pass (thanks to #Frank for tip to avoid hardcoded value)
, .N, by = rn][N == uniqueN(mDT$variable), rn]
head(iris[sort(selected_rn),])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
8 5.0 3.4 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
18 5.1 3.5 1.4 0.3 setosa