Assign multiple columns when using mutate in dtplyr - r

Is there a way of getting my data table to look like my target table when using dtplyr and mutate?`
A Dummy table
library(data.table)
library(dtplyr)
library(dplyr)
id <- rep(c("A","B"),each=3)
x1 <- rnorm(6)
x2 <- rnorm(6)
dat <- data.table(id,x1,x2)
A dummy function
my_fun <- function(x,y){
cbind(a = x+10,b=y-10)
}
And I would like to use this type of syntax
dat |>
group_by(id) |>
mutate(my_fun(x = x1,y = x2))
Where the end result will look like this
data.table(id, x1, x2, a=x1+10,b=x2-10)
I would like to have a generic solution that works for functions with variable number of columns returned but is that possible?

I think we would need more information about how this would work with a variable number of columns:
Are the columns named in a specific way?
Do the output columns need to be named in a specific way?
Are there standard calculations being done to each column dependent on name? E.g., x1 = +10 and x2 = -10?
At any rate, here is a solution that works with your provided data to return the data.table you specified:
my_fun <- function(data, ...){
dots <- list(...)
cbind(data,
a = data[[dots$x]] + 10,
b = data[[dots$y]] - 10
)
}
dat |>
my_fun(x = "x1", y = "x2")
id x1 x2 a b
1: A 0.8485309 -0.3532837 10.848531 -10.353284
2: A 0.7248478 -1.6561564 10.724848 -11.656156
3: A -1.3629114 0.4210139 8.637089 -9.578986
4: B -1.7934827 0.6717033 8.206517 -9.328297
5: B -1.0971890 -0.3008422 8.902811 -10.300842
6: B 0.4396630 -0.7447419 10.439663 -10.744742

Related

using rlang to select the entire dataframe and not just one column

I am trying to create a custom function where some operation is carried out only on one column of the dataframe. But I want the function to work in such a way that it will output not only the column on which the operation was carried out, but also the entire dataframe from which that particular column was drawn. This is a very simple example of what I want to achieve:
# libraries needed
library(dplyr)
library(rlang)
# creating a dataframe
data <-
as.data.frame(cbind(
x = rnorm(5),
y = rnorm(5),
z = rnorm(5)
))
# defining the custom function
custom.fn <- function(data, x) {
df <- dplyr::select(.data = data,
x = !!rlang::enquo(x)) # how can I also retain all the other columns in the dataframe apart from x?
df$x <- df$x / 2
return(df)
}
# calling the function (also want y and z here in the ouput)
custom.fn(data = data, x = x)
#> x
#> 1 0.49917536
#> 2 -0.03373202
#> 3 -1.24845349
#> 4 -0.15809688
#> 5 0.11237030
Created on 2018-02-14 by the reprex
package (v0.1.1.9000).
Just specify the columns you want to include in your select call:
custom.fn <- function(data, x) {
df <- dplyr::select(.data = data,
x = !!rlang::enquo(x), y, z)
df$x <- df$x / 2
return(df)
}
If you don't want to name the rest of the columns explicitly, you can also use everything:
df <- dplyr::select(.data = data, x = !!rlang::enquo(x), dplyr::everything())

perform operation on all rows and add results back to main data frame

I have a rather large dataset (15.000 rows) and I need to make calculations on each row due to the data structure. There is one column in my data set, that needs to be split further. Below is an example:
date <- c("2015-07-10", "2013-05-06", "2017-08-10")
Number <- c(345, 231, 10)
Route <- c("GCLP:10011:-8848:56:-4:270:260:12;LPC:1211:-828:56:-2:22:220:22;GCCC:13451:-85458:556:-45:45:76:67", "DPAP:10011:-8848:56:-4:270:260:12;LTTC:1211:-828:56:-2:22:220:22;ATCH:13451:-85458:556:-45:45:76:67", "AMN:10011:-8848:56:-4:270:260:12;RET:1211:-828:56:-2:22:220:22;LLOP:13451:-85458:556:-45:45:76:67")
Dep <- c("FGC","HAM","ICAO")
Plan <- data.frame(date, Number, Route, Dep)
For me the important information is in the column "Route". I need to generate aggreagted features from this column. The information in each cell of the column needs to be split by the ";"
What I tried so far:
select one row
create a new data frame just with this one row.
use mutate and unnest on the column "Route" to split it at the ";" points and create a new row for each
test <- Plan[1,]
test <- test %>% mutate(Route=strsplit(as.character(Route), ";")) %>% unnest(Route)
use cSplit to split the information in the column "Route" by the ":"
test = cSplit(test, "Route", ":")
I then perform my calculations on this subset of the data.
I create variables x,y,z to save my calculations
x1 <- mean(test$Route_2)
y1 <- max(test$Route_5)
z1 <- min(test$Route_8)
The TWO QUESTIONS:
How can I automate this operation for all rows in my original dataset?
How to I merge the data in the saved variables(x,y,z) back to my original data frame?
DESIRED OUTPUT
(these are not the actual values from the data for x2 and x3, just an example)
x1 <- 12
y1 <- 86363
z1 <- 7383
x2 <- 45
y2 <- 6754
z2 <- 3553
x3 <- 5648
y3 <- 64
z3 <- 6363
Plan$x <- c(x1,x2,x3)
Plan$y <- c(y1, y2, y3)
Plan$z <- c(z1,z2,z3)
head(Plan)
FULL SAMPLE CODE ALL AT ONCE
library(splitstackshape)
library(plyr)
library(tidyr)
date <- c("2015-07-10", "2013-05-06", "2017-08-10")
Number <- c(345, 231, 10)
Route <- c("GCLP:10011:-8848:56:-4:270:260:12;LPC:1211:-828:56:-2:22:220:22;GCCC:13451:-85458:556:-45:45:76:67", "DPAP:10011:-8848:56:-4:270:260:12;LTTC:1211:-828:56:-2:22:220:22;ATCH:13451:-85458:556:-45:45:76:67", "AMN:10011:-8848:56:-4:270:260:12;RET:1211:-828:56:-2:22:220:22;LLOP:13451:-85458:556:-45:45:76:67")
Dep <- c("FGC","HAM","ICAO")
Plan <- data.frame(date, Number, Route, Dep)
test <- Plan[1,]
test <- test %>% mutate(Route=strsplit(as.character(Route), ";")) %>% unnest(Route)
test = cSplit(test, "Route", ":")
x1 <- mean(test$Route_2)
y1 <- max(test$Route_5)
z1 <- min(test$Route_8)
x2 <- 45
y2 <- 6754
z2 <- 3553
x3 <- 5648
y3 <- 64
z3 <- 6363
Plan$x <- c(x1,x2,x3)
Plan$y <- c(y1, y2, y3)
Plan$z <- c(z1,z2,z3)
head(Plan)
Create a second temporary Route column called Route_tmp and from it generate a separate row for each component of it splitting by semicolon and then separate the resulting Route_tmp variable by colon into separate columns. Now grouping by the original variables we take the mean of the required columns. (Note that if we did not need Route in the output then we could have omitted the mutate at top and used Route in place of Route_tmp.)
library(dplyr)
library(tidyr)
out <- Plan %>%
mutate(Route_tmp = Route) %>%
separate_rows(Route_tmp, sep = ";") %>%
separate(Route_tmp, as.character(1:8), convert = TRUE) %>%
group_by(date, Number, Route, Dep) %>%
summarize(x = mean(`2`), y = mean(`5`), z = mean(`8`)) %>%
ungroup
giving the following (we do not show the Route column to make it easier to read):
> out[-3]
# A tibble: 3 × 6
date Number Dep x y z
<fctr> <dbl> <fctr> <dbl> <dbl> <dbl>
1 2013-05-06 231 HAM 8224.333 17 33.66667
2 2015-07-10 345 FGC 8224.333 17 33.66667
3 2017-08-10 10 ICAO 8224.333 17 33.66667
Note: Since Plan is overwritten in the question it was not clear to me precisely which version of Plan was the input but I have assumed this:
Plan <- data.frame(date = c("2015-07-10", "2013-05-06", "2017-08-10"),
Number = c(345, 231, 10),
Route = c("GCLP:10011:-8848:56:-4:270:260:12;LPC:1211:-828:56:-2:22:220:22;GCCC:13451:-85458:556:-45:45:76:67", "DPAP:10011:-8848:56:-4:270:260:12;LTTC:1211:-828:56:-2:22:220:22;ATCH:13451:-85458:556:-45:45:76:67", "AMN:10011:-8848:56:-4:270:260:12;RET:1211:-828:56:-2:22:220:22;LLOP:13451:-85458:556:-45:45:76:67"),
Dep = c("FGC","HAM","ICAO"))
Here's how I'd do it using tidyverse packages:
library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
# This function takes a single item from Plan$Route, splits it into its
# relevant columns and then finds the mean of columns 2, 5 and 8.
route_extract <- function(route) {
cols <- str_split(route, fixed(":"), simplify = TRUE)[, c(2, 5, 8), drop = FALSE]
# Converts the matrix to numeric without losing dimensions
storage.mode(cols) <- "numeric"
# Calculate the column means and then return the result as a `tibble`
cm <- colMeans(cols)
tibble(x = cm[1], y = cm[2], z = cm[3])
}
route_calc <- function(routes) {
str_split(routes, fixed(";")) %>%
map_df(route_extract)
}
Plan <- bind_cols(Plan, route_calc(Plan$Route))

dplyr rename_ produces error [duplicate]

dplyr's rename functions require the new column name to be passed in as unquoted variable names. However I have a function where the column name is constructed by pasting a string onto an argument passed in and so is a character string.
For example say I had this function
myFunc <- function(df, col){
new <- paste0(col, '_1')
out <- dplyr::rename(df, new = old)
return(out)
}
If I run this
df <- data.frame(a = 1:3, old = 4:6)
myFunc(df, 'x')
I get
a new
1 1 4
2 2 5
3 3 6
Whereas I want the 'new' column to be the name of the string I constructed ('x_1'), i.e.
a x_1
1 1 4
2 2 5
3 3 6
Is there anyway of doing this?
I think this is what you were looking for. It is the use of rename_ as #Henrik suggested, but the argument has an, lets say, interesting, name:
> myFunc <- function(df, col){
+ new <- paste0(col, '_1')
+ out <- dplyr::rename_(df, .dots=setNames(list(col), new))
+ return(out)
+ }
> myFunc(data.frame(x=c(1,2,3)), "x")
x_1
1 1
2 2
3 3
>
Note the use of setNames to use the value of new as name in the list.
Recent updates to tidyr and dplyr allow you to use the rename_with function.
Say you have a data frame:
library(tidyverse)
df <- tibble(V0 = runif(10), V1 = runif(10), V2 = runif(10), key=letters[1:10])
And you want to change all of the "V" columns. Usually, my reference for columns like this comes from a json file, which in R is a labeled list. e.g.,
colmapping <- c("newcol1", "newcol2", "newcol3")
names(colmapping) <- paste0("V",0:2)
You can then use the following to change the names of df to the strings in the colmapping list:
df <- rename_with(.data = df, .cols = starts_with("V"), .fn = function(x){colmapping[x]})

Filter data table by dynamic column name

lets say I have a data.table with columns A, B and C
I'd like to write a function that applies a filter (for example A>1) but "A" needs to be dynamic (the function's parameter) so if I inform A, it does A>1; If I inform B, it does B>1 and so on... (A and B always being the columns names, of course)
Example:
Lets say my data is bellow, I'd like to do "A==1" and it would return the green line, or do "B==1 & C==1" and return the blue line.
Can this be done?
thanks
You can try
f1 <- function(dat, colName){dat[eval(as.name(colName))>1]}
setDT(df1)
f1(df1, 'A')
f1(df1, 'B')
If you need to make the value also dynamic
f2 <- function(dat, colName, value){dat[eval(as.name(colName))>value]}
f2(df1, 'A', 1)
f2(df1, 'A', 5)
data
set.seed(24)
df1 <- data.frame(A=sample(-5:10, 20, replace=TRUE),
B=rnorm(20), C=LETTERS[1:20], stringsAsFactors=FALSE)
Try:
dt = data.table(A=c(1,1,2,3,1), B=c(4,5,1,1,1))
f=function(dt, colName) dt[dt[[colName]]>1,]
#> f(dt, 'A')
# A B
#1: 2 1
#2: 3 1
If your data is
a <- c(1:9)
b <- c(10:18)
# create a data.frame
df <- data.frame(a,b)
# or a data.table
dt <- data.table(a,b)
you can store your condition(s) in a variable x
x <- quote(a >= 3)
and filter the data.frame using dplyr (subsetting with [] won't work)
library(dplyr)
filter(df, x)
or using data.table as suggested by #Frank
library(data.table)
dt[eval(x),]
Why write a function? You can do this...
Specifically:
d.new=d[d$A>1,]
where d is the dataframe d$A is the variable and d.new is a new dataframe.
More generally:
data=d #data frame
variable=d$A #variable
minValue=1 #minimum value
d.new=data[variable>minValue,] #create new data frame (d.new) filtered by min value
To create a new column:
If you don't want to actually create a new dataframe but want to create an indicator variable you can use ifelse. This is most similar to coloring rows as shown in your example. Code below:
d$indicator1=ifelse(d$X1>0,1,0)

How to use ddply to get weighted-mean of class in dataframe?

I'm new to plyr and want to take the weighted mean of values within a class to reshape a dataframe for multiple variables. Using the following code, I know how to do this for one variable, such as x2:
set.seed(123)
frame <- data.frame(class=sample(LETTERS[1:5], replace = TRUE),
x=rnorm(20), x2 = rnorm(20), weights=rnorm(20))
ddply(frame, .(class),function(x) data.frame(weighted.mean(x$x2, x$weights)))
However, I would like the code to create a new data frame for x and x2 (and any amount of variables in the frame). Does anybody know how to do this? Thanks
You might find what you want in the ?summarise function. I can replicate your code with summarise as follows:
library(plyr)
set.seed(123)
frame <- data.frame(class=sample(LETTERS[1:5], replace = TRUE), x=rnorm(20),
x2 = rnorm(20), weights=rnorm(20))
ddply(frame, .(class), summarise,
x2 = weighted.mean(x2, weights))
To do this for x as well, just add that line to be passed into the summarise function:
ddply(frame, .(class), summarise,
x = weighted.mean(x, weights),
x2 = weighted.mean(x2, weights))
Edit: If you want to do an operation over many columns, use colwise or numcolwise instead of summarise, or do summarise on a melted data frame with the reshape2 package, then cast back to original form. Here's an example.
That would give:
wmean.vars <- c("x", "x2")
ddply(frame, .(class), function(x)
colwise(weighted.mean, w = x$weights)(x[wmean.vars]))
Finally, if you don't like having to specify wmean.vars, you can also do:
ddply(frame, .(class), function(x)
numcolwise(weighted.mean, w = x$weights)(x[!colnames(x) %in% "weights"]))
which will compute a weighted-average for every numerical field, excluding the weights themselves.
A data.table answer for fun, which also doesn't require specifying all the variables individually.
library(data.table)
frame <- as.data.table(frame)
keynames <- setdiff(names(frame),c("class","weights"))
frame[, lapply(.SD,weighted.mean,w=weights), by=class, .SDcols=keynames]
Result:
class x x2
1: B 0.1390808 -1.7605032
2: D 1.3585759 -0.1493795
3: C -0.6502627 0.2530720
4: E 2.6657227 -3.7607866

Resources