mutate several features simultaniously with a custom function - r

I've found several SO posts on this already but cannot see how to apply to my specific problem.
I have a dataframe with a number of features that I would like to simultaneously mutate. I want to write over them rather than create new features.
E.g. using mtcars. Suppose I want to amend am, gear and carb to be 1 if greater than 0 and 0 otherwise. For each of those 3 features. How could I do that?
mtcars %>% mutate_at(vs:carb, funs(???))
I want to apply a custom function of this form ifelse(x > 0, 1, 0) where x is either of the 3 features being worked on.
How can I achieve this?

You need to use vars() for vs:carb to parse, and you use . as a stand-in for the argument in funs:
mtcars %>% mutate_at(vars(vs:carb), funs(ifelse(. > 0, 1, 0)))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 1 1
# 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 1 1
# 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 1 1
# 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 1 1
# 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 1 1
# 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 1 1
# ...
This is explained in the ?funs help page:
A list of functions specified by:
Their name, "mean"
The function itself, mean
A call to the function with . as a dummy argument, mean(., na.rm = TRUE)
With this corresponding to the third bullet.

Related

Can you use "starts_with" as shorthand within a simple "as.numeric" function to query multiple columns?

I have a dataframe with multiple columns of a numeric type, where I want to query if a range of values exist in any of them, and bring back a true/false binary flag with as.numeric.
So I can do this the long way with:
df <- df %>%
mutate(flag = as.numeric(days_dry %in% c(1:28) |
days_frozen %in% c(1:28) |
days_fresh %in% c(1:28))
But I have a bunch of columns I want to query. Why can't I bring back the same result with this?:
df <- df %>%
mutate(flag = as.numeric(vars(starts_with("days_")) %in% c(1:28))
I get no error, but it doesn't bring back any cases which match the criteria.
There might be a better way, but ...
mtcars %>%
mutate(flag = rowSums(sapply(cbind(select(., starts_with("c"))), `%in%`, 4:6)) > 0) %>%
head()
# mpg cyl disp hp drat wt qsec vs am gear carb flag
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 TRUE
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 TRUE
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 TRUE
# Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 TRUE
# Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 FALSE
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 TRUE
The premise is using cbind(select(., <>))) to form a mid-pipe inner frame. From there, we sapply over its columns, converting them to columns of logicals. The last step is using rowSums(.) > 0 to determine if a row has at least one TRUE; an alternative to rowSums can use Reduce(``` | ```, ...), but while that is elegant in a list-processing kind of way, it is also slower (especially with multiple matching columns).

"Total score" based on multiple "above/below median values"

I am new to R and I have a problem that I simply cant find the solution to. I have tried reading through older posts etc. but I can't figure out how it could/should be done. I hope that some of you might be able to help.
Using the mtcars dataset as an example,
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
I wish to assign the cars a value (e.g. 1) if for instance the hp valvue is above the median and 0 if below. This could be called "hp_index".
I would do the same for lets say cyl, disp and drat, and then, I would like to assign the cars an "index_score" where a car for instance would be given the value 1 if a minimum of 3 out of the 4 hp, cyl, disp and drat is above median (that is, if 3 or 4 out of the 4 "hp_index", "cyl_index", "disp_index" and "drat_index" is 1).
Once again, I really hope that some of you might be able to help!
Thanks in advance, and have a nice day!
here's a tidyverse solution:
library(tidyverse)
mtcars %>%
mutate(across(c(hp, cyl, disp, drat), .fns = ~if_else(.x >= median(.x), 1, 0), .names = "{.col}_index"),
index_score = apply(across(c(hp_index, cyl_index, disp_index, drat_index)), 1, sum),
index_score = if_else(index_score >= 3, 1, 0))
Note: I created the conditions in a way that I split with >= median and < median. If you only use > and < there could be edge cases with missings if the row has exactly the median.
First six cases:
mpg cyl disp hp drat wt qsec vs am gear carb hp_index cyl_index disp_index drat_index index_score
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0 1 0 1 0
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0 1 0 1 0
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 0 0 0 1 0
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 0 1 1 0 0
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 1 1 1 0 1
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 0 1 1 0 0
Using dplyr :
library(dplyr)
cols <- c('cyl', 'disp', 'drat', 'hp')
mtcars %>%
mutate(across(cols, list(index = ~+(.x > median(.x))))) %>%
mutate(index_score = +(rowSums(select(., ends_with('index'))) >= 3)) -> result
result
Used select(., ends_with('index')) in rowSums assuming there are no other columns that end with index in your actual dataset apart from the newly created ones with across.
In this case based on all the columns, otherwise specific columns can be selected
mtcars$index=rowSums(
sapply(mtcars,function(x){
ifelse(x>median(x),1,0)
})
)
mpg cyl disp hp drat wt qsec vs am gear carb index
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 5
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 4
...
from here you can do another ifelse on the "index" column to decide whether it be 0 or 1.
To use different filters for columns, as you proposed
rowSums(
cbind(
sapply(mtcars[c("cyl", "disp", "drat")],function(x){ifelse(x>median(x),1,0)}),
sapply(mtcars["hp"],function(x){ifelse(x<median(x),1,0)})
)
)

Assign new variable created with mutate_ to a name that will be passed as a string in dplyr

I want to use strings with dplyr expressions and in particular I want to pass expressions as strings to mutate to create new variables and assign names to these variables that will be passed also as strings.
My code at this point is the following:
library(dplyr)
library(tidyverse)
data(mtcars)
mutate_expr = "gear * carb"
mtcars %>% mutate_(mutate_expr)
The new variable is named here 'gear*carb'. How I could give it the name 'gear_carb' passing the name to the dplyr expression as a string?
You now do this with tidyeval:
library(dplyr)
mutate_expr <- quo(gear * carb)
mtcars %>% mutate(new_col = !!mutate_expr) %>% head()
#> mpg cyl disp hp drat wt qsec vs am gear carb new_col
#> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 16
#> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 16
#> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4
#> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 3
#> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6
#> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 3
If you must store the expression as a string, you can use sym instead of quo (really rlang::parse_expr in this context), but storing code as a character string is a bad idea.

Add new data.frame column based on values in other columns

I'm trying to iterate over a data table to calculate the integral of two columns, a dt$xmin and a dt$xmax, with a function f, having the answer be written to a new column dt$integral. I'm currently trying to use something like the below code without success:
dt$integral <- mapply(f, dt$xmin, dt$xmax)
Any help would be greatly appreciated!
Perhaps you do not need mapply and a simple assignment should work dt$integral<- f(dt$min, dt$max). There is no data or example of what you want but here's what I think could work for you (using data.table):
library(data.table)
dt <- as.data.table(mtcars)
newfunc <- function(a, b){
return(a + log(b) - exp(a/b) + 3.1*a^1.918)
}
# Adding a new column called "newcol"
> head(dt[, newcol := newfunc(wt, mpg/qsec),])
mpg cyl disp hp drat wt qsec vs am gear carb newcol
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 14.73145
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 16.30387
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 11.45233
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 13.87593
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 13.78816
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 -10.84599
For a single new column the above style of variable assignment would work. For multiple new columns and functions, you would need to use a function that returns a list for the new columns. Look up more on assignment using := in data.table.

Move a column to first position in a data frame

I would like to have the last column of the data frame moved to the start (as first column). How can I do it in R?
My data.frame has about a thousand columns to changing the order wont to. I just want to pick one column and "move it to the start".
Dplyr's select() approach
Moving the last column to the start:
new_df <- df %>%
select(last_column_name, everything())
This is also valid for any column and any quantity:
new_df <- df %>%
select(col_5, col_8, everything())
Example using mtcars data frame:
head(mtcars, n = 2)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Last column is 'carb'
new_df <- mtcars %>% select(carb, everything())
head(new_df, n = 2)
# carb mpg cyl disp hp drat wt qsec vs am gear
# Mazda RX4 4 21.0 6 160 110 3.90 2.620 16.46 0 1 4
# Mazda RX4 Wag 4 21.0 6 160 110 3.90 2.875 17.02 0 1 4
dplyr 1.0.0 now includes the relocate() function to reorder columns. The default behaviour is to move the named column(s) to the first position.
library(dplyr) # from version 1.0.0
mtcars %>%
relocate(carb) %>%
head()
carb mpg cyl disp hp drat wt qsec vs am gear
Mazda RX4 4 21.0 6 160 110 3.90 2.620 16.46 0 1 4
Mazda RX4 Wag 4 21.0 6 160 110 3.90 2.875 17.02 0 1 4
Datsun 710 1 22.8 4 108 93 3.85 2.320 18.61 1 1 4
Hornet 4 Drive 1 21.4 6 258 110 3.08 3.215 19.44 1 0 3
Hornet Sportabout 2 18.7 8 360 175 3.15 3.440 17.02 0 0 3
Valiant 1 18.1 6 225 105 2.76 3.460 20.22 1 0 3
But other locations can be specifed with the .before or .after arguments:
mtcars %>%
relocate(gear, carb, .before = cyl) %>%
head()
mpg gear carb cyl disp hp drat wt qsec vs am
Mazda RX4 21.0 4 4 6 160 110 3.90 2.620 16.46 0 1
Mazda RX4 Wag 21.0 4 4 6 160 110 3.90 2.875 17.02 0 1
Datsun 710 22.8 4 1 4 108 93 3.85 2.320 18.61 1 1
Hornet 4 Drive 21.4 3 1 6 258 110 3.08 3.215 19.44 1 0
Hornet Sportabout 18.7 3 2 8 360 175 3.15 3.440 17.02 0 0
Valiant 18.1 3 1 6 225 105 2.76 3.460 20.22 1 0
You can change the order of columns by adressing them in the new order by choosing them explicitly with data[,c(ORDER YOU WANT THEM TO BE IN)]
If you just want the last column to be first use: data[,c(ncol(data),1:(ncol(data)-1))]
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> head(cars[,c(2,1)])
dist speed
1 2 4
2 10 4
3 4 7
4 22 7
5 16 8
6 10 9
dataframe<-dataframe[,c(1000, 1:999)]
this will move your last column i.e. 1000th column to the first column.
I don't know if it's worth adding this as an answer or if a comment would be fine, but I wrote a function called moveme that lets you do what you want to do with the language you describe. You can find the function at this answer: https://stackoverflow.com/a/18540144/1270695
It works on the names of your data.frame and produces a character vector that you can use to reorder your columns:
mydf <- data.frame(matrix(1:12, ncol = 4))
mydf
moveme(names(mydf), "X4 first")
# [1] "X4" "X1" "X2" "X3"
moveme(names(mydf), "X4 first; X1 last")
# [1] "X4" "X2" "X3" "X1"
mydf[moveme(names(mydf), "X4 first")]
# X4 X1 X2 X3
# 1 10 1 4 7
# 2 11 2 5 8
# 3 12 3 6 9
If you're shuffling things around like this, I suggest converting your data.frame to a data.table and using setcolorder (with my moveme function, if you wish) to make the change by reference.
In your question, you also mentioned "I just want to pick one column and move it to the start". If it's an arbitrary column, and not specifically the last one, you could also look at using setdiff.
Imagine you're working with the "mtcars" dataset and want to move the "am" column to the start.
x <- "am"
mtcars[c(x, setdiff(names(mtcars), x))]
If you want to move any named column to the first position, simply use:
df[,c(which(colnames(df)=="desired_colname"),which(colnames(df)!="desired_colname"))]
A native R approach that works with any number of rows or columns to move the last column of a dataframe to the first column position:
df <- df[,c(ncol(df),1:ncol(df)-1)]
It can be used to move any column to the first column by replacing:
df <- df[,c(your_column_number_here,1:ncol(df)-1)]
If you don't know the column number, but know the column label name, do the following replacing "your_column_name_here":
columnNumber <- which(colnames(df)=="your_column_name_here")
df <- df[,c(columnNumber,1:ncol(df)-1)]
There is also the data.table option with setcolorder():
library(data.table)
mtcars_copy <- copy(mtcars)
setDT(mtcars_copy)
# Move column "gear" in the first position
setcolorder(mtcars_copy, neworder = "gear")
head(mtcars_copy)
# gear mpg cyl disp hp drat wt qsec vs am carb
# 1: 4 21.0 6 160 110 3.90 2.620 16.46 0 1 4
# 2: 4 21.0 6 160 110 3.90 2.875 17.02 0 1 4
# 3: 4 22.8 4 108 93 3.85 2.320 18.61 1 1 1
# 4: 3 21.4 6 258 110 3.08 3.215 19.44 1 0 1
# 5: 3 18.7 8 360 175 3.15 3.440 17.02 0 0 2
# 6: 3 18.1 6 225 105 2.76 3.460 20.22 1 0 1
If multiple columns, then mention the order in a vector:
setcolorder(mtcars_copy, neworder = c("vs", "carb"))
head(mtcars_copy)
# vs carb gear mpg cyl disp hp drat wt qsec am
# 1: 0 4 4 21.0 6 160 110 3.90 2.620 16.46 1
# 2: 0 4 4 21.0 6 160 110 3.90 2.875 17.02 1
# 3: 1 1 4 22.8 4 108 93 3.85 2.320 18.61 1
# 4: 1 1 3 21.4 6 258 110 3.08 3.215 19.44 0
# 5: 0 2 3 18.7 8 360 175 3.15 3.440 17.02 0
# 6: 1 1 3 18.1 6 225 105 2.76 3.460 20.22 0
Move any column from any position for the first position in your data
n <- which(colnames(df)=="column_need_move")
column_need_move <- df$column_need_to_move
df <- cbind(column_need_move, df[,-n])
If you want to create a new column and have it be the first column, use the .before=1 argument:
my_data <- my_data %>% mutate(newcol = a*b, .before=1)

Resources