How to mutate a column with if/then in R Data.frame - r

Lets suppose if the data is
data <- head(iris)
How can i create a new column whose values will be derived from data$Sepal.Length in a way that if data$Sepal.Length is equal to or greater than 5, value will be 5 and if its less or equal to 3, value will be 3, else values should remain same...
I have tried
data %>% mutate(Sepal.Length = case_when(Sepal.Length <=3 ~ '3',Sepal.Length>=5 ~ '5'))
But it is giving NA to remaining values..

You can do this using a basic case_when statement:
data %>%
mutate(Sepal.Length = case_when(
Sepal.Length <= 3 ~ 3,
Sepal.Length >= 5 ~ 5,
TRUE ~ Sepal.Length))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5 3.6 1.4 0.2 setosa
# 6 5 3.9 1.7 0.4 setosa```

Related

Succinct subsetting across multiple columns in R

Say I have a massive dataframe and in multiple columns I have an extremely large list of unique codes and I want to use these codes to select certain rows to subset the original dataframe. There are around 1000 codes and the codes I want all follow after each other. For example I have about 30 columns that contain codes and I only want to take rows that have codes 100 to 120 in ANY of these columns .
There's a long way to do this which is something like
new_dat <- df[which(df$codes==100 | df$codes==101 | df$codes1==100
and I repeat this for every single possible code for everyone of the columns that can contain these codes. Is there a way to do this in a more convenient fashion?
I want to try solving this with dplyr's select function, but I'm having trouble seeing if it works for my case out of the box
Take the iris dataset
Say I wanted all rows that contain the value 4.0-5.0 in any columns that contains the word Sepal in the column name.
#this only goes for 4.0
brand_new_df <- select(filter(iris, Sepal.Length ==4.0 | Sepal.Width == 4.0))
but what I want is something like
brand_new_df <- select(filter(iris, contains(Sepal) == 4.0:5.0))
Is there a dplyr way to do this?
A corresponding across() version from #RonakShah's answer:
library(dplyr)
iris %>% filter(rowSums(across(contains('Sepal'), ~ between(., 4, 5))) > 0)
or
iris %>% filter(rowSums(across(contains('Sepal'), between, 4, 5)) > 0)
From vignette("colwise"):
Previously, filter() was paired with the all_vars() and any_vars() helpers. Now, across() is equivalent to all_vars(), and there’s no direct replacement for any_vars().
So you need something like rowSums(...) > 0 to achieve the effect of any_vars().
You can use filter_at :
library(dplyr)
iris %>% filter_at(vars(contains('Sepal')), any_vars(between(., 4, 5)))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#3 4.6 3.1 1.5 0.2 setosa
#4 5.0 3.6 1.4 0.2 setosa
#5 4.6 3.4 1.4 0.3 setosa
#6 5.0 3.4 1.5 0.2 setosa
#7 4.4 2.9 1.4 0.2 setosa
#....
Base R:
# Subset:
cols <- grep("codes", names(df2), value = TRUE)
df2[rowSums(sapply(cols,
function(x) {
df2[, x] >= 100 & df2[, x] <= 120
})) == length(cols), ]
# Data:
tmp <- data.frame(x1 <- rnorm(999, mean = 100, sd = 2))
df <-
setNames(data.frame(tmp[rep(1, each = 80)]), paste0("codes", 1:80))
df2 <- cbind(id = 1:nrow(df), df)
One option could be:
iris %>%
filter(Reduce(`|`, across(contains("Sepal"), ~ between(.x, 4, 5))))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.9 3.0 1.4 0.2 1
2 4.7 3.2 1.3 0.2 1
3 4.6 3.1 1.5 0.2 1
4 5.0 3.6 1.4 0.2 1
5 4.6 3.4 1.4 0.3 1
6 5.0 3.4 1.5 0.2 1
7 4.4 2.9 1.4 0.2 1
8 4.9 3.1 1.5 0.1 1
9 4.8 3.4 1.6 0.2 1
10 4.8 3.0 1.4 0.1 1
library(dplyr)
df <- iris
# value to look for
val <- 4
# find columns
cols <- which(colSums(df == val , na.rm = TRUE) > 0L)
# filter rows
iris %>% filter_at(cols, any_vars(.==val))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.8 4.0 1.2 0.2 setosa
2 5.5 2.3 4.0 1.3 versicolor
3 6.0 2.2 4.0 1.0 versicolor
4 6.1 2.8 4.0 1.3 versicolor
5 5.5 2.5 4.0 1.3 versicolor
6 5.8 2.6 4.0 1.2 versicolor

which.max not functioning as expected

I am trying to create a table that includes the value of y for when x is equal to or less than a certain value, by group. Below is my code using the iris data set.
For "<=2.5", I expect to get 4.5, 5.0, or 5.8 for the virginica group, since these are the values of Petal.Length associated with a Sepal.Width of 2.5 for virginica. But instead, I get 6.0. Any ideas of where I went wrong? (My actual data set does not have duplicates of the variable analogous to Sepal.Width for the same group, so choosing among those is not an issue for me.)
data(iris)
my.table <- iris %>%
group_by(Species) %>%
summarise("<=2.5" = Petal.Length[which.max(Sepal.Width[Sepal.Width<=2.5])],
"<=3" = Petal.Length[which.max(Sepal.Width[Sepal.Width<=3])],
"<=3.5" = Petal.Length[which.max(Sepal.Width[Sepal.Width<=3.5])],
"<=4" = Petal.Length[which.max(Sepal.Width[Sepal.Width<=4])])
This is related to the question Create a table with values from ecdf graph
The problem is that you are first subsetting the Sepal.Width. Consequently, the index returned by which.max applies to that sub-vector, and no longer corresponds to the indices of the whole Petal.Length vector.
To fix this, you also need to subset Petal.Length correspondingly, e.g.
…
`<=2.5` = Petal.Length[Sepal.Width <= 2.5][which.max(Sepal.Width[Sepal.Width <= 2.5])],
…
… of course this gets rather verbose and repetitive. It might be better to perform the subsetting in a separate step. However, this means creating new columns for every threshold value.
Incidentally, this is unrelated to dplyr.
To make it more scalable, using double loop:
myCuts <- c(2.5, 3, 3.5, 4)
res <- sapply(split(iris, iris$Species), function(i)
sapply(myCuts, function(j){
x <- i[ i$Sepal.Width <= j, ]
x$Petal.Length[ which.max(x$Sepal.Width) ]
}))
rownames(res) <- paste0("<=", myCuts)
res
# setosa versicolor virginica
# <=2.5 1.3 3.9 4.5
# <=3 1.4 4.2 5.9
# <=3.5 1.4 4.5 5.6
# <=4 1.2 4.5 6.7
Here's another way to get the same data. Create a group variable according to Sepal.Width values. Then within each group, select the row with the top Sepal.Width value. It is in a different "shape", but you can always pivot_wider if you want all the values as columns instead of rows.
iris %>%
group_by(Species,
Sepal.Width_grp = case_when(Sepal.Width <= 2.5 ~ '<=2.5',
Sepal.Width <= 3 ~ '<=3',
Sepal.Width <= 3.5 ~ '<=3.5',
Sepal.Width <= 4 ~ '<=4',
TRUE ~ '> 4')) %>%
top_n(1, -Sepal.Width) %>%
select(Species, Sepal.Width_grp, Top.Sepal.Width = Sepal.Width, Petal.Width)
# # A tibble: 25 x 4
# # Groups: Species, Sepal.Width_grp [12]
# Species Sepal.Width_grp Top.Sepal.Width Petal.Width
# <fct> <chr> <dbl> <dbl>
# 1 setosa <=3.5 3.1 0.2
# 2 setosa <=4 3.6 0.2
# 3 setosa <=3 2.9 0.2
# 4 setosa <=3.5 3.1 0.1
# 5 setosa <=4 3.6 0.2
# 6 setosa <=3.5 3.1 0.2
# 7 setosa > 4 4.1 0.1
# 8 setosa <=3.5 3.1 0.2
# 9 setosa <=4 3.6 0.1
# 10 setosa <=2.5 2.3 0.3
# # ... with 15 more rows
Edit: A little simpler if you use cut
iris %>%
group_by(Species,
Sepal.Width_grp = cut(Sepal.Width, c(0, 2.5, 3, 3.5, 4, Inf))) %>%
top_n(1, -Sepal.Width) %>%
select(Species, Sepal.Width_grp, Top.Sepal.Width = Sepal.Width, Petal.Width)
# # A tibble: 25 x 4
# # Groups: Species, Sepal.Width_grp [12]
# Species Sepal.Width_grp Top.Sepal.Width Petal.Width
# <fct> <fct> <dbl> <dbl>
# 1 setosa (3,3.5] 3.1 0.2
# 2 setosa (3.5,4] 3.6 0.2
# 3 setosa (2.5,3] 2.9 0.2
# 4 setosa (3,3.5] 3.1 0.1
# 5 setosa (3.5,4] 3.6 0.2
# 6 setosa (3,3.5] 3.1 0.2
# 7 setosa (4,Inf] 4.1 0.1
# 8 setosa (3,3.5] 3.1 0.2
# 9 setosa (3.5,4] 3.6 0.1
# 10 setosa (0,2.5] 2.3 0.3
# # ... with 15 more rows

dplyr nested ifelse errors - is it vector recycling?

I can write this code that adds two columns to the iris data set. The first added column is a sum of the first four columns. The second added column is my attempt at "programming".
iris.size <- iris %>%
mutate(Total =
apply(.[(1:4)], 1, sum)
) %>%
mutate(Size =
ifelse(
apply(.[(1:4)], 1, sum) != 0 &
.[2] > .[3], "Output1",
ifelse(
apply(.[(1:4)], 1, sum) == 0 &
.[2] > .[3], "Output2",
"Output3")
)
)
You'll notice this code does not throw any errors and it does output what I want it to output. But watch what happens when I try my next step in analysis.
iris.size %>% arrange(Size)
Error: Column Size must be a 1d atomic vector or a list
It must be my ifelse logic. Correct? Ifelse logic seems straightforward. If condition 1 than output1, otherwise if condition 2 than output2, otherwise output3.
I ended up forcing iris.size$Size into a vector using as.vector but I'd like to know where my logic went wrong in the first place so I don't have to resort to using band aids in the future. After some googling it sounds like if statements are preferred over ifelse statements in R, but if statements only seem to work on single logical values, not vectors.
When you run your code, you get this output as iris.size:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Total Sepal.Width
1 5.1 3.5 1.4 0.2 setosa 10.2 Output1
2 4.9 3.0 1.4 0.2 setosa 9.5 Output1
3 4.7 3.2 1.3 0.2 setosa 9.4 Output1
4 4.6 3.1 1.5 0.2 setosa 9.4 Output1
5 5.0 3.6 1.4 0.2 setosa 10.2 Output1
6 5.4 3.9 1.7 0.4 setosa 11.4 Output1
The reason why it's not displaying Size is because the column Size has not been created. The reason that is occurring is because you're comparing two objects of class data.frame() with .[2] > .[3], not two vectors which would happen with .[, 2] > .[, 3].
I'm still trying to understand what is being created. What is that Sepal.Width column?
Adjust yours with the following:
iris.size <- iris %>% mutate(Total =
apply(.[(1:4)], 1, sum) ) %>% mutate(Size =
ifelse(
apply(.[(1:4)], 1, sum) != 0 &
.[,2] > .[,3], "Output1",
ifelse(
apply(.[(1:4)], 1, sum) == 0 &
.[,2] > .[,3], "Output2",
"Output3")
) )
iris.size
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Total Size
1 5.1 3.5 1.4 0.2 setosa 10.2 Output1
2 4.9 3.0 1.4 0.2 setosa 9.5 Output1
3 4.7 3.2 1.3 0.2 setosa 9.4 Output1
4 4.6 3.1 1.5 0.2 setosa 9.4 Output1
5 5.0 3.6 1.4 0.2 setosa 10.2 Output1
6 5.4 3.9 1.7 0.4 setosa 11.4 Output1
Suggestion:
Here's a condensed version of your code, if you're interested. You can replace Sepal.Width and Sepal.Length with .[,2] and .[,3] if need be.
iris.size <- iris %>%
mutate(Total = rowSums(.[,sapply(., is.numeric)]),
Size = ifelse(Total != 0 & Sepal.Width > Sepal.Length, "Output1",
ifelse(Total == 0 & Sepal.Width > Sepal.Length, "Output2", "Output3")))%>%
arrange(Size)
iris.size
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Total Size
1 5.1 3.5 1.4 0.2 setosa 10.2 Output1
2 4.9 3.0 1.4 0.2 setosa 9.5 Output1
3 4.7 3.2 1.3 0.2 setosa 9.4 Output1
4 4.6 3.1 1.5 0.2 setosa 9.4 Output1
5 5.0 3.6 1.4 0.2 setosa 10.2 Output1
6 5.4 3.9 1.7 0.4 setosa 11.4 Output1
Making use of rowwise and splitting things up a bit for readability...
iris.size <- iris %>%
mutate(Total =
apply(.[(1:4)], 1, sum)
)
iris.size <-iris.size %>% rowwise %>% mutate(Size =
if(
Total != 0 && Sepal.Width > Petal.Length) {
"Output1"
} else {
if(Total == 0 && Petal.Length > Petal.Length){
"Output2"
} else {
"Output3"}}
)
class(iris.size$Size)
[1] "character"
> iris.size %>% arrange(Size)
# A tibble: 150 x 7
Sepal.Length Sepal.Width Petal.Length Petal.Width
<dbl> <dbl> <dbl> <dbl>
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
7 4.6 3.4 1.4 0.3
8 5.0 3.4 1.5 0.2
9 4.4 2.9 1.4 0.2
10 4.9 3.1 1.5 0.1
# ... with 140 more rows, and 3 more variables:
# Species <fctr>, Total <dbl>, Size <chr>
>
The error message is caused by the fact that iris.size["Size"] is an object of type data.frame(). This can be confirmed by the str() function:
> str(iris.size["Size"])
'data.frame': 150 obs. of 1 variable:
$ Size: chr [1:150, 1] "Output1" "Output1" "Output1" "Output1" ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "Sepal.Width"
>
Casting the object with as.vector() resolves the problem because the data frame contains 1 column.

Use column names from vector in for loop in dplyr

this should probably be quite straightforward, but I am struggling to get it to work. I currently have a vector of column names:
columns <- c('product1', 'product2', 'product3', 'support4')
I now want to use dplyr in a for loop to mutate some columns, but I am struggling to make it recognize that it is a column name, not a variable.
for (col in columns) {
cross.sell.val <- cross.sell.val %>%
dplyr::mutate(col = ifelse(col == 6, 6, col)) %>%
dplyr::mutate(col = ifelse(col == 5, 6, col))
}
Can I use %>% in these situations? Thanks..
You should be able to do this without using a for loop at all.
Because you didn't provide any data, I am going to use the builtin iris dataset. The top of it looks like:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
First, I am saving the columns to analyze:
columns <- names(iris)[1:4]
Then, use mutate_at for each column, along with that particular rule. In each, the . represents the vector for each column. Your example implies that the rules are the same for each column, though if that is not the case, you may need more flexibility here.
mod_iris <-
iris %>%
mutate_at(columns, funs(ifelse(. > 5, 6, .))) %>%
mutate_at(columns, funs(ifelse(. < 1, 1, .)))
returns:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 6.0 3.5 1.4 1 setosa
2 4.9 3.0 1.4 1 setosa
3 4.7 3.2 1.3 1 setosa
4 4.6 3.1 1.5 1 setosa
5 5.0 3.6 1.4 1 setosa
6 6.0 3.9 1.7 1 setosa
If you wanted to, you could instead write a function to make all of your changes for the column. This could also allow you to set the cutoffs differently for each column. For example, you may want to set the bottom and top portions of the data to be equal to that threshold (to reign in outliers for some reason), or you may know that each variable uses a dummy value as a placeholder (and that value is different by column, but is always the most common value). You could easily add in any arbitrary rule of interest this way, and it gives you a bit more flexibility than chaining together separate rules (e.g., if you use the mean, the mean changes when you change some of the values).
An example function:
modColumns <- function(x){
botThresh <- quantile(x, 0.25)
topThresh <- quantile(x, 0.75)
dummyVal <- as.numeric(names(sort(table(x)))[1])
dummyReplace <- NA
x <- ifelse(x < botThresh, botThresh, x)
x <- ifelse(x > topThresh, topThresh, x)
x <- ifelse(x == dummyVal, dummyReplace, x)
return(x)
}
And in use:
iris %>%
mutate_at(columns, modColumns) %>%
head
returns:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.3 1.6 0.3 setosa
2 5.1 3.0 1.6 0.3 setosa
3 5.1 3.2 1.6 0.3 setosa
4 5.1 3.1 1.6 0.3 setosa
5 5.1 3.3 1.6 0.3 setosa
6 5.4 3.3 1.7 0.4 setosa

Conditional non-equi join

library(tidyverse)
iris <- iris
means <- iris %>%
group_by(Species) %>%
summarise_all(funs(mean))
sd <- iris %>%
group_by(Species) %>%
summarise_all(funs(sd))
bottom <- means[ ,2:5] - sd[ ,2:5]
bottom$Species <- c("setosa", "versicolor", "virginica")
print(bottom)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.653510 3.048936 1.288336 0.1406144 setosa
2 5.419829 2.456202 3.790089 1.1282473 versicolor
3 5.952120 2.651503 5.000105 1.7513499 virginica
top <- means[ ,2:5] + sd[ ,2:5]
top$Species <- c("setosa", "versicolor", "virginica")
print(top)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.358490 3.807064 1.635664 0.3513856 setosa
2 6.452171 3.083798 4.729911 1.5237527 versicolor
3 7.223880 3.296497 6.103895 2.3006501 virginica
How do I get the rows of Iris where the values for Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width all fall between the values in the top and bottom data frames?
For example, I only want setosa rows where Sepal.Length > 4.65 & Sepal.Length < 5.35 and Sepal.Width is between 3.04 and 3.80, etc. Ideally the end result contains only the 4 numeric columns and the species column.
Thanks.
It would be much easier if you can filter from the beginning without the summarize step:
iris %>%
group_by(Species) %>%
filter_if(is.numeric, all_vars(. < mean(.) + sd(.) & . > mean(.) - sd(.)))
# A tibble: 54 x 5
# Groups: Species [3]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.7 3.2 1.3 0.2 setosa
# 3 5.0 3.6 1.4 0.2 setosa
# 4 5.0 3.4 1.5 0.2 setosa
# 5 4.8 3.4 1.6 0.2 setosa
# 6 5.1 3.5 1.4 0.3 setosa
# 7 5.1 3.8 1.5 0.3 setosa
# 8 5.2 3.5 1.5 0.2 setosa
# 9 5.2 3.4 1.4 0.2 setosa
#10 4.7 3.2 1.6 0.2 setosa
# ... with 44 more rows
Not sure if you can avoid the summarize step, post as an option here.
Or use between:
iris %>%
group_by(Species) %>%
filter_if(is.numeric, all_vars(between(., mean(.) - sd(.), mean(.) + sd(.))))
Here is a solution using non-equi joins which is building on the (now deleted) approach of #Frank:
library(data.table)
# add a row number column and to reshape from wide to long
DT <- melt(data.table(iris)[, rn := .I], id = c("rn", "Species"))
# compute lower and upper bound for each variable and Species
mDT <- DT[, .(lb = lb <- mean(value) - (s <- sd(value)),
ub = lb + 2 * s), by = .(Species, variable)]
# find row numbers of items which fulfill conditions
selected_rn <-
# non-equi join
DT[DT[mDT, on = .(Species, variable, value > lb, value < ub), which = TRUE]][
# all uniqueN(mDT$variable) variables must have been selected
# for an item to pass (thanks to #Frank for tip to avoid hardcoded value)
, .N, by = rn][N == uniqueN(mDT$variable), rn]
head(iris[sort(selected_rn),])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
8 5.0 3.4 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
18 5.1 3.5 1.4 0.3 setosa

Resources