I'm trying to create a binary set of variables that uses data across multiple columns.
I have a dataset where I'm trying to create a binary variable where any column with a specific name will be indexed for a certain value. I'll use iris as an example dataset.
Let's say I want to create a variable where any column with the string "Sepal" and any row in those columns with the values of 5.1, 3.0, and 4.7 will become "Class A" while values with 3.1, 5.0, and 5.4 will be "Class B". So let's look at the first few entries of iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
The first 3 rows should then be under "Class A" While rows 4-6 will be under "Class B". I tried writing this code to do that
mutate(iris, Class = if_else(
vars(contains("Sepal")), any_vars(. %in% c(5.1,3.0, 4.7))), "Class A",
ifelse(vars(contains("Sepal")), any_vars(. %in% c(3.1, 5.0, 5.4))), "Class B",NA)
and received the error
Error: `condition` must be a logical vector, not a `quosures/list` object
So I've realized I need lapply here, but I'm not even sure where to begin to write this because I'm not sure how to represent the entire part of selecting columns with "Sepal" in the name and also include the specific values in those rows as one list object to provide to lapply
This is clearly the wrong syntax
lapply(vars(contains("Sepal")), any_vars(. %in% c(5.1,3.0, 4.7)))
Examples using case_when will also be accepted as answers.
If you want to do this using dplyr, you can use rowwise with new c_across :
library(dplyr)
iris %>%
rowwise() %>%
mutate(Class = case_when(
any(c_across(contains("Sepal")) %in% c(5.1,3.0, 4.7)) ~ 'Class A',
any(c_across(contains("Sepal")) %in% c(3.1,5.0,5.4)) ~ 'Class B')) %>%
head
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Class
# <dbl> <dbl> <dbl> <dbl> <fct> <chr>
#1 5.1 3.5 1.4 0.2 setosa Class A
#2 4.9 3 1.4 0.2 setosa Class A
#3 4.7 3.2 1.3 0.2 setosa Class A
#4 4.6 3.1 1.5 0.2 setosa Class B
#5 5 3.6 1.4 0.2 setosa Class B
#6 5.4 3.9 1.7 0.4 setosa Class B
However, note that using %in% on numerical values is not accurate. If interested you may read Why are these numbers not equal?
Related
Not sure why the first one has an error but the second line works? My understanding was using names(.) in the formulas tells R to use the data before pipe operator. It seems to work for .cols argument but not for formula.
iris%>%rename_with(~gsub("Petal","_",names(.)),all_of(names(.)))
iris%>%rename_with(~~gsub("Petal","_",names(iris)),all_of(names(.)))
rename_with applies a function to the names of the passed data frame. The function should be one that, given the vector of names, returns the altered names, so the syntax is much simpler than you are trying to make it:
iris %>%
rename_with(~ gsub("Petal", "_", .x))
#> Sepal.Length Sepal.Width _.Length _.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#... etc
I want to rename multiple columns that starts with the same string.
However, all the codes I tried did not change the columns.
For example this:
df %>% rename_at(vars(matches('^oldname,\\d+$')), ~ str_replace(., 'oldname', 'newname'))
And also this:
df %>% rename_at(vars(starts_with(oldname)), funs(sub(oldname, newname, .))
Are you familiar with a suitable code for rename?
Thank you!
Take iris for example, you can use rename_with() to replace those column names started with "Petal" with a new string.
head(iris) %>%
rename_with(~ sub("^Petal", "New", .x), starts_with("Petal"))
Sepal.Length Sepal.Width New.Length New.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
You can also use rename_at() in this case, although rename_if(), rename_at(), and rename_all() have been superseded by rename_with().
head(iris) %>%
rename_at(vars(starts_with("Petal")), ~ sub("^Petal", "New", .x))
I want to multiply a value (0.045) with specific columns (that start with "i") in a dataset. There is also a column called "id" that has the value 0.045 in all rows.
I've tried this, which did not work:
df %>%
mutate(across(starts_with("i")), ~.id)
The columns to be multiplied can be specified based on position or based on the fact that they all start with "i"
Hope someone can help me.
Thanks a lot!
Magnus
Try this. I used iris dataset in order to create the example. Be careful that the new definition for mutating the columns should be inside across() and not outside it, as you have in the shared code. Here the solution:
library(tidyverse)
#Code
iris %>%
mutate(across(starts_with("Sepal"), ~.*0.045))
Output (some rows):
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 0.2295 0.1575 1.4 0.2 setosa
2 0.2205 0.1350 1.4 0.2 setosa
3 0.2115 0.1440 1.3 0.2 setosa
4 0.2070 0.1395 1.5 0.2 setosa
5 0.2250 0.1620 1.4 0.2 setosa
6 0.2430 0.1755 1.7 0.4 setosa
7 0.2070 0.1530 1.4 0.3 setosa
8 0.2250 0.1530 1.5 0.2 setosa
9 0.1980 0.1305 1.4 0.2 setosa
Base R solution:
cols_bool <- startsWith(names(iris), "Sepal")
cbind(iris[,!cols_bool, drop = FALSE], iris[,cols_bool, drop = FALSE] * 0.045)
I'm looking to use a non-across function from mutate to create multiple columns. My problem is that the variable in the function will change along with the crossed variables. Here's an example:
needs=c('Sepal.Length','Petal.Length')
iris %>% mutate_at(needs, ~./'{col}.Width')
This obviously doesn't work, but I'm looking to divide Sepal.Length by Sepal.Width and Petal.Length by Petal.Width.
I think your needs should be something which is common in both the columns.
You can select the columns based on the pattern in needs and divide the data based on position. !! and := is used to assign name of the new columns.
library(dplyr)
library(rlang)
needs = c('Sepal','Petal')
purrr::map_dfc(needs, ~iris %>%
select(matches(.x)) %>%
transmute(!!paste0(.x, '_divide') := .[[1]]/.[[2]]))
# Sepal_divide Petal_divide
#1 1.457142857 7.000000000
#2 1.633333333 7.000000000
#3 1.468750000 6.500000000
#4 1.483870968 7.500000000
#...
#...
If you want to add these as new columns you can do bind_cols the above with iris.
Here is a base R approach based that the columns you want to divide have a similar name pattern,
res <- sapply(split.default(iris[-ncol(iris)], sub('\\..*', '', names(iris[-ncol(iris)]))), function(i) i[1] / i[2])
iris[names(res)] <- res
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Petal.Length Sepal.Sepal.Length
#1 5.1 3.5 1.4 0.2 setosa 7.00 1.457143
#2 4.9 3.0 1.4 0.2 setosa 7.00 1.633333
#3 4.7 3.2 1.3 0.2 setosa 6.50 1.468750
#4 4.6 3.1 1.5 0.2 setosa 7.50 1.483871
#5 5.0 3.6 1.4 0.2 setosa 7.00 1.388889
#6 5.4 3.9 1.7 0.4 setosa 4.25 1.384615
I am trying to create new variable in a dataset based on the value of an indicator. The following is the code for the same:
prac_data <- head(iris,10)
COPY_IND='Y' ##declaring the indicator to be 'Y'
prac_data <- prac_data %>% mutate(New_Var=ifelse(COPY_IND=='Y', Sepal.Length, 'N'))
I get the following output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species New_Var
1 5.1 3.5 1.4 0.2 setosa 5.1
2 4.9 3.0 1.4 0.2 setosa 5.1
3 4.7 3.2 1.3 0.2 setosa 5.1
4 4.6 3.1 1.5 0.2 setosa 5.1
5 5.0 3.6 1.4 0.2 setosa 5.1
6 5.4 3.9 1.7 0.4 setosa 5.1
7 4.6 3.4 1.4 0.3 setosa 5.1
8 5.0 3.4 1.5 0.2 setosa 5.1
9 4.4 2.9 1.4 0.2 setosa 5.1
10 4.9 3.1 1.5 0.1 setosa 5.1
I actually want to copy the variable 'Sepal.Length' in the 'New_Var' for every observation if indicator(COPY_IND) is Yes('Y').
If I do the the following, I get the desired response:
if (COPY_IND=='Y')
{
prac_data$New_Var <- prac_data$Sepal.Length
} else {prac_data$New_Var <- 'N'}
I just want to understand why R treats both 'if-else' approaches differently?
Is there another better elegant way to the same?
Thanks in advance!!
Actually, this might be easier to read as an answer.
From ifelse() help: "ifelse returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE".
Your test is just a single value, so ifelse() returns a single value, either Sepal.Length[1] or N, which is then duplicated across the whole column.
You need rowwise() on your way: prac_data <- prac_data %>% rowwise() %>% mutate(New_Var = ifelse(COPY_IND=='Y', Sepal.Length, 'N'))
COPY_IND is always "Y" in your case, then the code could be simplified to prac_data$New_Var = prac_data$Sepal.Length. Even if you want to use ifelse statement row-wisely, it is better to follow the instructions in the help document
Further note that if(test) yes else no is much more efficient and often much preferable to ifelse(test, yes, no) whenever test is a simple true/false result, i.e., when length(test) == 1.
I guess the desired COPY_IND should be one column of the data frame/vector rather than a single fixed value. In this case, you code generate the right answer, e.g. keep the first five number:
library(dplyr)
prac_data <- head(iris,10)
prac_data$COPY_IND=c(rep('Y',5),rep('N',5))
#COPY_IND=c(rep('Y',5),rep('N',5)) works too
prac_data <- prac_data %>% mutate(New_Var=ifelse(COPY_IND=='Y', Sepal.Length, 'N'))
generates the right column.