How to tidily create multiple columns from sets of columns? - r

I'm looking to use a non-across function from mutate to create multiple columns. My problem is that the variable in the function will change along with the crossed variables. Here's an example:
needs=c('Sepal.Length','Petal.Length')
iris %>% mutate_at(needs, ~./'{col}.Width')
This obviously doesn't work, but I'm looking to divide Sepal.Length by Sepal.Width and Petal.Length by Petal.Width.

I think your needs should be something which is common in both the columns.
You can select the columns based on the pattern in needs and divide the data based on position. !! and := is used to assign name of the new columns.
library(dplyr)
library(rlang)
needs = c('Sepal','Petal')
purrr::map_dfc(needs, ~iris %>%
select(matches(.x)) %>%
transmute(!!paste0(.x, '_divide') := .[[1]]/.[[2]]))
# Sepal_divide Petal_divide
#1 1.457142857 7.000000000
#2 1.633333333 7.000000000
#3 1.468750000 6.500000000
#4 1.483870968 7.500000000
#...
#...
If you want to add these as new columns you can do bind_cols the above with iris.

Here is a base R approach based that the columns you want to divide have a similar name pattern,
res <- sapply(split.default(iris[-ncol(iris)], sub('\\..*', '', names(iris[-ncol(iris)]))), function(i) i[1] / i[2])
iris[names(res)] <- res
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Petal.Length Sepal.Sepal.Length
#1 5.1 3.5 1.4 0.2 setosa 7.00 1.457143
#2 4.9 3.0 1.4 0.2 setosa 7.00 1.633333
#3 4.7 3.2 1.3 0.2 setosa 6.50 1.468750
#4 4.6 3.1 1.5 0.2 setosa 7.50 1.483871
#5 5.0 3.6 1.4 0.2 setosa 7.00 1.388889
#6 5.4 3.9 1.7 0.4 setosa 4.25 1.384615

Related

Mutate if variable name appears in a list

I would like to use dplyr to divide a subset of variables by the IQR. I am open to ideas that use a different approach than what I've tried before, which is a combination of mutate_if and %in%. I want to reference the list bin instead of indexing the data frame by position. Thanks for any thoughts!
contin <- c("age", "ct")
data %>%
mutate_if(%in% contin, function(x) x/IQR(x))
You should use:
data %>%
mutate(across(all_of(contin), ~.x/IQR(.x)))
Working example:
data <- head(iris)
contin <- c("Sepal.Length", "Sepal.Width")
data %>%
mutate(across(all_of(contin), ~.x/IQR(.x)))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 15.69231 7.777778 1.4 0.2 setosa
2 15.07692 6.666667 1.4 0.2 setosa
3 14.46154 7.111111 1.3 0.2 setosa
4 14.15385 6.888889 1.5 0.2 setosa
5 15.38462 8.000000 1.4 0.2 setosa
6 16.61538 8.666667 1.7 0.4 setosa

Multiply value with specific columns

I want to multiply a value (0.045) with specific columns (that start with "i") in a dataset. There is also a column called "id" that has the value 0.045 in all rows.
I've tried this, which did not work:
df %>%
mutate(across(starts_with("i")), ~.id)
The columns to be multiplied can be specified based on position or based on the fact that they all start with "i"
Hope someone can help me.
Thanks a lot!
Magnus
Try this. I used iris dataset in order to create the example. Be careful that the new definition for mutating the columns should be inside across() and not outside it, as you have in the shared code. Here the solution:
library(tidyverse)
#Code
iris %>%
mutate(across(starts_with("Sepal"), ~.*0.045))
Output (some rows):
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 0.2295 0.1575 1.4 0.2 setosa
2 0.2205 0.1350 1.4 0.2 setosa
3 0.2115 0.1440 1.3 0.2 setosa
4 0.2070 0.1395 1.5 0.2 setosa
5 0.2250 0.1620 1.4 0.2 setosa
6 0.2430 0.1755 1.7 0.4 setosa
7 0.2070 0.1530 1.4 0.3 setosa
8 0.2250 0.1530 1.5 0.2 setosa
9 0.1980 0.1305 1.4 0.2 setosa
Base R solution:
cols_bool <- startsWith(names(iris), "Sepal")
cbind(iris[,!cols_bool, drop = FALSE], iris[,cols_bool, drop = FALSE] * 0.045)

Creating variables from list objects in R

I'm trying to create a binary set of variables that uses data across multiple columns.
I have a dataset where I'm trying to create a binary variable where any column with a specific name will be indexed for a certain value. I'll use iris as an example dataset.
Let's say I want to create a variable where any column with the string "Sepal" and any row in those columns with the values of 5.1, 3.0, and 4.7 will become "Class A" while values with 3.1, 5.0, and 5.4 will be "Class B". So let's look at the first few entries of iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
The first 3 rows should then be under "Class A" While rows 4-6 will be under "Class B". I tried writing this code to do that
mutate(iris, Class = if_else(
vars(contains("Sepal")), any_vars(. %in% c(5.1,3.0, 4.7))), "Class A",
ifelse(vars(contains("Sepal")), any_vars(. %in% c(3.1, 5.0, 5.4))), "Class B",NA)
and received the error
Error: `condition` must be a logical vector, not a `quosures/list` object
So I've realized I need lapply here, but I'm not even sure where to begin to write this because I'm not sure how to represent the entire part of selecting columns with "Sepal" in the name and also include the specific values in those rows as one list object to provide to lapply
This is clearly the wrong syntax
lapply(vars(contains("Sepal")), any_vars(. %in% c(5.1,3.0, 4.7)))
Examples using case_when will also be accepted as answers.
If you want to do this using dplyr, you can use rowwise with new c_across :
library(dplyr)
iris %>%
rowwise() %>%
mutate(Class = case_when(
any(c_across(contains("Sepal")) %in% c(5.1,3.0, 4.7)) ~ 'Class A',
any(c_across(contains("Sepal")) %in% c(3.1,5.0,5.4)) ~ 'Class B')) %>%
head
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Class
# <dbl> <dbl> <dbl> <dbl> <fct> <chr>
#1 5.1 3.5 1.4 0.2 setosa Class A
#2 4.9 3 1.4 0.2 setosa Class A
#3 4.7 3.2 1.3 0.2 setosa Class A
#4 4.6 3.1 1.5 0.2 setosa Class B
#5 5 3.6 1.4 0.2 setosa Class B
#6 5.4 3.9 1.7 0.4 setosa Class B
However, note that using %in% on numerical values is not accurate. If interested you may read Why are these numbers not equal?

Smart spreadsheet parsing (managing group sub-header and sum rows, etc)

Say you have a set of spreadsheets formatted like so:
Is there an established method/library to parse this into R without having to individually edit the source spreadsheets? The aim is to parse header rows and dispense with sum rows so the output is the raw data, like so:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 7.0 3.2 4.7 1.4 versicolor
5 6.4 3.2 4.5 1.5 versicolor
6 6.9 3.1 4.9 1.5 versicolor
7 5.7 2.8 4.1 1.3 versicolor
8 6.3 3.3 6.0 2.5 virginica
9 5.8 2.7 5.1 1.9 virginica
10 7.1 3.0 5.9 2.1 virginica
I can certainly hack a tailored solution to this, but wondering there is something a bit more developed/elegant than read.csv and a load of logic.
Here's a reproducible demo csv dataset (can't assume an equal number of lines per group..), although I'm hoping the solution can transpose to *.xlsx:
,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
,,,,
Setosa,,,,
1,5.1,3.5,1.4,0.2
2,4.9,3,1.4,0.2
3,4.7,3.2,1.3,0.2
Mean,4.9,3.23,1.37,0.2
,,,,
Versicolor,,,,
1,7,3.2,4.7,1.4
2,6.4,3.2,4.5,1.5
3,6.9,3.1,4.9,1.5
Mean,6.77,3.17,4.7,1.47
,,,,
Virginica,,,,
1,6.3,3.3,6,2.5
2,5.8,2.7,5.1,1.9
3,7.1,3,5.9,2.1
Mean,6.4,3,5.67,2.17
There is a variety of ways to present spreadsheets so it would be hard to have a consistent methodology for all presentations. However, it is possible to transform the data once it is loaded in R. Here's an example with your data. It uses the function na.locf from package zoo.
x <- read.csv(text=",Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
,,,,
Setosa,,,,
1,5.1,3.5,1.4,0.2
2,4.9,3,1.4,0.2
3,4.7,3.2,1.3,0.2
Mean,4.9,3.23,1.37,0.2
,,,,
Versicolor,,,,
1,7,3.2,4.7,1.4
2,6.4,3.2,4.5,1.5
3,6.9,3.1,4.9,1.5
Mean,6.77,3.17,4.7,1.47
,,,,
Virginica,,,,
1,6.3,3.3,6,2.5
2,5.8,2.7,5.1,1.9
3,7.1,3,5.9,2.1
Mean,6.4,3,5.67,2.17", header=TRUE, stringsAsFactors=FALSE)
library(zoo)
x <- x[x$X!="Mean",] #remove Mean line
x$Species <- x$X #create species column
x$Species[grepl("[0-9]",x$Species)] <- NA #put NA if Species contains numbers
x$Species <- na.locf(x$Species) #carry last observation if NA
x <- x[!rowSums(is.na(x))>0,] #remove lines with NA
X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
3 1 5.1 3.5 1.4 0.2 Setosa
4 2 4.9 3.0 1.4 0.2 Setosa
5 3 4.7 3.2 1.3 0.2 Setosa
9 1 7.0 3.2 4.7 1.4 Versicolor
10 2 6.4 3.2 4.5 1.5 Versicolor
11 3 6.9 3.1 4.9 1.5 Versicolor
15 1 6.3 3.3 6.0 2.5 Virginica
16 2 5.8 2.7 5.1 1.9 Virginica
17 3 7.1 3.0 5.9 2.1 Virginica
I just recently did something similar. Here was my solution:
iris <- read.csv(text=",Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
,,,,
Setosa,,,,
1,5.1,3.5,1.4,0.2
2,4.9,3,1.4,0.2
3,4.7,3.2,1.3,0.2
Mean,4.9,3.23,1.37,0.2
,,,,
Versicolor,,,,
1,7,3.2,4.7,1.4
2,6.4,3.2,4.5,1.5
3,6.9,3.1,4.9,1.5
Mean,6.77,3.17,4.7,1.47
,,,,
Virginica,,,,
1,6.3,3.3,6,2.5
2,5.8,2.7,5.1,1.9
3,7.1,3,5.9,2.1
Mean,6.4,3,5.67,2.17", header=TRUE, stringsAsFactors=FALSE)
First I used a which splits at an index.
split_at <- function(x, index) {
N <- NROW(x)
s <- cumsum(seq_len(N) %in% index)
unname(split(x, s))
}
Then you define that index using:
iris[,1] <- stringr::str_trim(iris[,1])
index <- which(iris[,1] %in% c("Virginica", "Versicolor", "Setosa"))
The rest is just using purrr::map_df to perform actions on each data.frame in the list that's returned. You can add some additional flexibility for removing unwanted rows if needed.
split_at(iris, index) %>%
.[2:length(.)] %>%
purrr::map_df(function(x) {
Species <- x[1,1]
x <- x[-c(1,NROW(x) - 1, NROW(x)),]
data.frame(x, Species = Species)
})

Sum column every n column in a data frame R

I have a df(A) with 10 column and 300 row. I need to sum every two column, between them, in this way:
A[,1]+A[,2] = # first result
A[,3]+A[,4] = # second result
A[,5]+A[,6]= # third result
....
A[,9]+A[,10] # last result
The expected final result is a new dataframe with 5 column and 300 row.
Any way to do this? with tapply or loop for?
I know that i can try with the upon example, but i'm looking for a fast method
Thank you
We could use sapply:
df <- data.frame(replicate(expr=rnorm(100),n = 10))
sapply(seq(1,9,by=2),function(i) rowSums(df[,i:(i+1)]))
You can do it without *apply loops.
Sample data:
df <- head(iris[-5])
df
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 5.1 3.5 1.4 0.2
#2 4.9 3.0 1.4 0.2
#3 4.7 3.2 1.3 0.2
#4 4.6 3.1 1.5 0.2
#5 5.0 3.6 1.4 0.2
#6 5.4 3.9 1.7 0.4
Now you can use vector recycling of a logicals:
df[c(TRUE,FALSE)] + df[c(FALSE,TRUE)]
# Sepal.Length Petal.Length
#1 8.6 1.6
#2 7.9 1.6
#3 7.9 1.5
#4 7.7 1.7
#5 8.6 1.6
#6 9.3 2.1
It's a bit cryptic but I it should be fast. We add each column to the adjacent column. Then delete the unnecessary results with c(T,F) which recycles through odd columns:
(A[1:(ncol(A)-1)] + A[2:ncol(A)])[c(T,F)]

Resources