I am trying to create a custom function where some operation is carried out only on one column of the dataframe. But I want the function to work in such a way that it will output not only the column on which the operation was carried out, but also the entire dataframe from which that particular column was drawn. This is a very simple example of what I want to achieve:
# libraries needed
library(dplyr)
library(rlang)
# creating a dataframe
data <-
as.data.frame(cbind(
x = rnorm(5),
y = rnorm(5),
z = rnorm(5)
))
# defining the custom function
custom.fn <- function(data, x) {
df <- dplyr::select(.data = data,
x = !!rlang::enquo(x)) # how can I also retain all the other columns in the dataframe apart from x?
df$x <- df$x / 2
return(df)
}
# calling the function (also want y and z here in the ouput)
custom.fn(data = data, x = x)
#> x
#> 1 0.49917536
#> 2 -0.03373202
#> 3 -1.24845349
#> 4 -0.15809688
#> 5 0.11237030
Created on 2018-02-14 by the reprex
package (v0.1.1.9000).
Just specify the columns you want to include in your select call:
custom.fn <- function(data, x) {
df <- dplyr::select(.data = data,
x = !!rlang::enquo(x), y, z)
df$x <- df$x / 2
return(df)
}
If you don't want to name the rest of the columns explicitly, you can also use everything:
df <- dplyr::select(.data = data, x = !!rlang::enquo(x), dplyr::everything())
Related
Is there a way of getting my data table to look like my target table when using dtplyr and mutate?`
A Dummy table
library(data.table)
library(dtplyr)
library(dplyr)
id <- rep(c("A","B"),each=3)
x1 <- rnorm(6)
x2 <- rnorm(6)
dat <- data.table(id,x1,x2)
A dummy function
my_fun <- function(x,y){
cbind(a = x+10,b=y-10)
}
And I would like to use this type of syntax
dat |>
group_by(id) |>
mutate(my_fun(x = x1,y = x2))
Where the end result will look like this
data.table(id, x1, x2, a=x1+10,b=x2-10)
I would like to have a generic solution that works for functions with variable number of columns returned but is that possible?
I think we would need more information about how this would work with a variable number of columns:
Are the columns named in a specific way?
Do the output columns need to be named in a specific way?
Are there standard calculations being done to each column dependent on name? E.g., x1 = +10 and x2 = -10?
At any rate, here is a solution that works with your provided data to return the data.table you specified:
my_fun <- function(data, ...){
dots <- list(...)
cbind(data,
a = data[[dots$x]] + 10,
b = data[[dots$y]] - 10
)
}
dat |>
my_fun(x = "x1", y = "x2")
id x1 x2 a b
1: A 0.8485309 -0.3532837 10.848531 -10.353284
2: A 0.7248478 -1.6561564 10.724848 -11.656156
3: A -1.3629114 0.4210139 8.637089 -9.578986
4: B -1.7934827 0.6717033 8.206517 -9.328297
5: B -1.0971890 -0.3008422 8.902811 -10.300842
6: B 0.4396630 -0.7447419 10.439663 -10.744742
I have a dataframe, df, with several columns in it. I would like to create a function to create new columns dynamically using existing column names. Part of it is using the last four characters of an existing column name. For example, I would like to create a variable names df$rev_2002 like so:
df$rev_2002 <- df$avg_2002 * df$quantity
The problem is I would like to be able to run the function every time a new column (say, df$avg_2003) is appended to the dataframe.
To this end, I used the following function to extract the last 4 characters of the df$avg_2002 variable:
substRight <- function (x,n) {
substr(x, nchar(x)-n+1, nchar(x))
}
I tried putting together another function to create the columns:
revved <- function(x, y, z){
z = x * y
names(z) <- paste('revenue', substRight(x,4), sep = "_")
return x
}
But when I try it on actual data I don't get new columns in my df. The desired result is a series of variables in my df such as:
df$rev_2002, df$rev_2003...df$rev_2020 or whatever is the largest value of the last four characters of the x variable (df$avg_2002 in example above).
Any help or advice would be truly appreciated. I'm really in the woods here.
dat <- data.frame(id = 1:2, quantity = 3:4, avg_2002 = 5:6, avg_2003 = 7:8, avg_2020 = 9:10)
func <- function(dat, overwrite = FALSE) {
nms <- grep("avg_[0-9]+$", names(dat), value = TRUE)
revnms <- gsub("avg_", "rev_", nms)
if (!overwrite) revnms <- setdiff(revnms, names(dat))
dat[,revnms] <- lapply(dat[,nms], `*`, dat$quantity)
dat
}
func(dat)
# id quantity avg_2002 avg_2003 avg_2020 rev_2002 rev_2003 rev_2020
# 1 1 3 5 7 9 15 21 27
# 2 2 4 6 8 10 24 32 40
dplyr's rename functions require the new column name to be passed in as unquoted variable names. However I have a function where the column name is constructed by pasting a string onto an argument passed in and so is a character string.
For example say I had this function
myFunc <- function(df, col){
new <- paste0(col, '_1')
out <- dplyr::rename(df, new = old)
return(out)
}
If I run this
df <- data.frame(a = 1:3, old = 4:6)
myFunc(df, 'x')
I get
a new
1 1 4
2 2 5
3 3 6
Whereas I want the 'new' column to be the name of the string I constructed ('x_1'), i.e.
a x_1
1 1 4
2 2 5
3 3 6
Is there anyway of doing this?
I think this is what you were looking for. It is the use of rename_ as #Henrik suggested, but the argument has an, lets say, interesting, name:
> myFunc <- function(df, col){
+ new <- paste0(col, '_1')
+ out <- dplyr::rename_(df, .dots=setNames(list(col), new))
+ return(out)
+ }
> myFunc(data.frame(x=c(1,2,3)), "x")
x_1
1 1
2 2
3 3
>
Note the use of setNames to use the value of new as name in the list.
Recent updates to tidyr and dplyr allow you to use the rename_with function.
Say you have a data frame:
library(tidyverse)
df <- tibble(V0 = runif(10), V1 = runif(10), V2 = runif(10), key=letters[1:10])
And you want to change all of the "V" columns. Usually, my reference for columns like this comes from a json file, which in R is a labeled list. e.g.,
colmapping <- c("newcol1", "newcol2", "newcol3")
names(colmapping) <- paste0("V",0:2)
You can then use the following to change the names of df to the strings in the colmapping list:
df <- rename_with(.data = df, .cols = starts_with("V"), .fn = function(x){colmapping[x]})
I am trying to split data table by column, however once I get list of data tables, they still contains the column which data table was split by. How would I drop this column once the split is complete. Or more preferably, is there a way how do I drop multiple columns.
This is my code:
x <- rnorm(10, mean = 5, sd = 2)
y <- rnorm(10, mean = 5, sd = 2)
z <- sample(5, 10, replace = TRUE)
dt <- data.table(x, y, z)
split(dt, dt$z)
The resulting data table subsets looks like that
$`1`
x y z
1: 6.179790 5.776683 1
2: 5.725441 4.896294 1
3: 8.690388 5.394973 1
$`2`
x y z
1: 5.768285 3.951733 2
2: 4.572454 5.487236 2
$`3`
x y z
1: 5.183101 8.328322 3
2: 2.830511 3.526044 3
$`4`
x y z
1: 5.043010 5.566391 4
2: 5.744546 2.780889 4
$`5`
x y z
1: 6.771102 0.09301977 5
Thanks
Splitting a data.table is really not worthwhile unless you have some fancy parallelization step to follow. And even then, you might be better off sticking with a single table.
That said, I think you want
split( dt[, !"z"], dt$z )
# or more generally
mysplitDT <- function(x, bycols)
split( x[, !..bycols], x[, ..bycols] )
mysplitDT(dt, "z")
You would run into the same problem if you had a data.frame:
df = data.frame(dt)
split( df[-which(names(df)=="z")], df$z )
First thing that came to mind was to iterate through the list and drop the z column.
lapply(split(dt, dt$z), function(d) { d$z <- NULL; d })
And I just noticed that you use the data.table package, so there is probably a better, data.table way of achieving your desired result.
I'm trying to identify the values in a data frame that do not match, but can't figure out how to do this.
# make data frame
a <- data.frame( x = c(1,2,3,4))
b <- data.frame( y = c(1,2,3,4,5,6))
# select only values from b that are not in 'a'
# attempt 1:
results1 <- b$y[ !a$x ]
# attempt 2:
results2 <- b[b$y != a$x,]
If a = c(1,2,3) this works, as a is a multiple of b. However, I'm trying to just select all the values from data frame y, that are not in x, and don't understand what function to use.
If I understand correctly, you need the negation of the %in% operator. Something like this should work:
subset(b, !(y %in% a$x))
> subset(b, !(y %in% a$x))
y
5 5
6 6
Try the set difference function setdiff. So you would have
results1 = setdiff(a$x, b$y) # elements in a$x NOT in b$y
results2 = setdiff(b$y, a$x) # elements in b$y NOT in a$x
You could also use dplyr for this task. To find what is in b but not a:
library(dplyr)
anti_join(b, a, by = c("y" = "x"))
# y
# 1 5
# 2 6