R pipe operator for indexing - r

I want to have the same kind of pipe operator as dplyr's %>%, but for indexing instead of passing functions. For example I want to define function %l% such that for at least vectors and matrices:
1 %l% df would be equivalent to df[1] (vector) or df[1,] (matrices).
abc <- c("a","b", "c")
def <- c("d","e", "f")
df <- data.frame(abc, def, stringsAsFactors = F)
df %l% 1

You could do this, for example. Subsetting rows and columns for both vectors and matrices / data.frames.
`%l%` <- lineindex <- function(x, y) {
result <- NA
if(is.null(dim(y))|(!is.null(dim(y))&ncol(y)==1)) {
result <- y[x]
} else if(length(dim(y))==2) {
result <- y[x,]
}
return(result)
}
`%c%` <- colindex <- function(x, y) {
result <- NA
if(is.null(dim(y))|(!is.null(dim(y))&ncol(y)==1)) {
result <- y[x]
} else if(length(dim(y))==2) {
result <- y[,x]
}
return(result)
}
c(1,NA,2) %>% is.na() %>% which() %l% df
abc def
2 b e
c(1,NA,2) %>% is.na() %>% which() %c% df
[1] "d" "e" "f"

Related

Function argument as value or column name for data.table

I want my function to be able to take a value or a column name. How can I do this with data.table?
library(data.table)
df <- data.table(a = c(1:5),
b = c(5:1),
c = c(1, 3, 5, 3, 1))
myfunc <- function(val) {
df[a >= val]
}
# This works:
myfunc(2)
# This does not work:
myfunc("c")
If I define my function as:
myfunc <- function(val) {
df[a >= get(val)]
}
# This doesn't work:
myfunc(2)
# This works:
myfunc("c")
What is the best way to resolve this?
Edit: To be clear, I want to results to be the same as:
# myfunc(2)
df %>%
filter(a >= 2)
# myfunc("c")
df %>%
filter(a >= c)
EDIT:
Thanks all for the responses, I think I like dww's answer the best.
I wish it was as easy as in dplyr, where I can do:
myfunc <- function(val) {
df %>%
filter(a >= {{val}})
}
# Both work:
myfunc(2)
myfunc(c)
If you build and parse the whole expression, then you can evaluate it in its entirety. For example
myfunc <- function(val) {
df[eval(parse(text=paste("a >= ", val)))]
}
Though relying on a function that lets you mix values and variable names in the same parameter might be dangerous. Especially in the case where you actually wanted to match on character values rather than variable names. If you passed in the whole expression you could do
myfunc <- function(expr) {
expr <- substitute(expr)
df[eval(expr)]
}
myfunc(a>=3)
myfunc(a>=c)
The question did not actually define the desired behavior so we assume that df must be a data.table and if a character string is passed then the column of that name should be returned and if a number is passed then those rows whose a column exceed that number should be returned.
Define an S3 generic and methods for character and default.
myfunc <- function(x, data = df) UseMethod("myfunc")
myfunc.character <- function(x, data = df) data[[x]]
myfunc.default <- function(x, data = df) data[a > x]
myfunc(2)
## a b c
## 1: 3 3 5
## 2: 4 2 3
## 3: 5 1 1
myfunc("c")
## [1] 1 3 5 3 1

save an output into a vector in R?

I have the following function which finds the distinct number of cases belonging to 4 different factors. test is a list containing 4 dataframes
for (i in test){
i<-i%>%distinct(FileNumber)%>%nrow()
print(i)
}
when i run this, I get the following output
[1] 38
[1] 129
[1] 1868
[1] 277
However I want this output to be saved into another vector called my_vector. So that my_vector is
38 129 1868 277
So I tried the following based on this answer I found
Saving results from for loop as a vector in r
library(dplyr)
my_vector<-vector("numeric",4L)
for (i in test){
my_vector[i]<-i%>%distinct(FileNumber)%>%nrow()
}
However when I run this I get the following message
Error in my_vector[i] <- i %>% distinct(FileNumber) %>% nrow() :
invalid subscript type 'list'
How do I get the earlier output I listed saved into a vector?
You are trying to index my_vector with a list-like object.
For instance:
mylist <- list(mtcars, mtcars)
myvec <- numeric(length(mylist))
for (i in mylist) {
myvec[i] <- nrow(distinct(i, cyl))
}
On the first (and second in this example) iteration, i is a frame, so myvec[i] is equivalent to myvec[mtcars], which does not make sense.
Instead, loop over the index of the list of frames, ala:
library(dplyr)
mylist <- list(mtcars, mtcars)
myvec <- numeric(length(mylist))
for (i in seq_len(length(mylist))) {
myvec[i] <- test[[i]] %>% distinct(cyl) %>% nrow()
}
myvec
# [1] 3 3
or just do something like:
sapply(mylist, function(l) l %>% distinct(cyl) %>% nrow())
# [1] 3 3
BTW: this is just as easy in base-R with:
sapply(mylist, function(l) length(unique(l[["cyl"]])))
# [1] 3 3
This should work with a list of data frames or matrices
d <- list(a = matrix(rnorm(100), nrow = 20),
b = matrix(rnorm(100), nrow = 10),
c = matrix(rnorm(100), nrow = 50))
my_vect <- c()
for (i in seq_along(d)){
n <- nrow(d[[i]])
my_vect[i] <- n
}
my_vect
[1] 20 10 50
Use unlist() and if that doesn't work, then add as.vector() in your pipe:
for (i in test){
i<-i %>% distinct(FileNumber) %>% nrow() %>% unlist()
print(i)
}
If that does not come out as a vector then:
for (i in test){
i<-i %>% distinct(FileNumber) %>% nrow() %>% unlist() %>% as.vector()
print(i)
}

Operations on the columns with the same names in two data set in R

I have two datasets as follows:
df1 <- data.frame(a =c(1), b=c(4), c=c(1))
df2 <- data.frame (b =c(4), c=c(1), a=c(4))
I wish to do operations on the columns with the same name: for example for column a, I want to do this operation (a(in df1) - a (in df2)) / a (in df1)
that is (1-4) /1 = -3
So my ideal output will be:
a b c
-3 0 0
I'll try to write a function as follows but not sure how to proceed. Any tip is really appreciated.
my_func <- function(x,y) {
for (i in names(x))
if ((i %in% names(y))) {
df3 [i,] <- (x[i,] - y[i,]) / x[i,]
}
}
Update
Ideally, it would be great if I can consider the missing columns. For example, if there is a missing column in df2, I wish to have 0 and if the missing column is in df2 the output can be anything like"no value"
Taking help from #Headpoint answers, you can directly do this (no need of loops)
df1 <- data.frame(a =c(1), b=c(4), c=c(1))
df1 <- df1[, order(names(df1))]
df2 <- data.frame (b =c(4), c=c(1), a=c(4))
df2 <- df2[, order(names(df2))]
all_col_names <- unique(colnames(df1), colnames(df2))
df1[, all_col_names] - df2[, all_col_names]
a b c
1 -3 0 0
Is this what you are after?
res <- NULL
for (str in colnames(df1))
res <- c(res, (df1[str] - df2[str]) / df1[str] )
res
#$a
#[1] -3
#$b
#[1] 0
#$c
#[1] 0
If you wish make it a numeric
out <- as.numeric(res)
names(out) <- names(res)
out
# a b c
#-3 0 0
If the columns do not match...
col_nam1 <- colnames(df1)
col_nam2 <- colnames(df2)
all_col_names <- unique(c(col_nam1, col_nam2))
res <- NULL
for (str in all_col_names)
if ((str %in% col_nam1) && (str %in%col_nam2))
res <- c(res, (df1[str] - df2[str]) / df1[str])
Inspired from #Hardik gupta, doing it without loops:
common_names <- sort(intersect(col_nam2, col_nam1))
(df1[, common_names] - df2[, common_names]) / df1[, common_names]
Zeros if df1 has column that df2 doesn't
df1 <- data.frame(a =c(1), b=c(4), c=c(1), f = 4)
df2 <- data.frame (b =c(4), c=c(1), a=c(4), g = 5)
col_nam1 <- colnames(df1)
col_nam2 <- colnames(df2)
common_names <- intersect(col_nam2, col_nam1)
col_names <- sort(unique(col_nam1, common_names))
res <- numeric(length(col_names))
names(res) <- col_names
res[common_names] <- (df1[, common_names] - df2[, common_names]) /
df1[, common_names]
out <- as.numeric(res)
names(out) <- names(res)
out
a b c f
-3 0 0 0

How to structure a function in R whose input is a data.frame and the row by row output is conditional on the contents?

Consider the following table:
V1 V2 V3 V4
1 A X -0.2834111 -1.5095923
2 A X 0.3114088 -0.1706417
3 B Y 0.2544403 -0.4790589
4 B X 0.6209947 -1.8988974
5 C X 1.7428690 -0.2251725
I would like to write a function which spits out a calculation for each line, but the calculation depends on the contents of the various variables in that line. For example.
If V1 = A, Output f(V3,V4)
If V1 = B, Output g(V3,V4)
If V1 = C, Output 0
If V1 = B AND V2 = Y, Output h(V3,V4)
Where f,g,h are suitably vectorised functions. What is the best way to write a function that produces the vector of outputs calculated by a bunch of functions that depend on the rules and the contents of the columns in the data.frame.
Right now, I've got a wrapper function whose input is a data.frame, which then plugs the required columns in to the main function which calls the sub functions depending on the conditions.
eg:
foo_wrapper <- function(x){
foo(x$V1, x$V2, x$V3, x$V4)
}
And the main function is:
foo <- function(V1,V2,V3,V4){
#Define Functions
f <- function() ... (some vectorized function)
g <- function() ...
h <- function() ...
#Produce results
res <- NA
res <- ifelse(V1 == "A", f(V1,V2), res)
res <- ifelse(V1 == "C", 0, res)
res <- ifelse(V1 == "B" & V2 != "Y", g(V3,V4), res)
res <- ifelse(V1 == "B" & V2 == "Y", h(V3,V4), res)
return(res)
}
It's slow and I'm sure there's a much better way.
Any insight would be greatly appreciated.
EDIT: Let's say f,g,h are:
f <- function(){
V3*V4
}
g <- function(){
pmax(V3,V4)
}
h <- function(){
exp(-1*V3)/(y+V4)
}
Here is a possible optimization-- but without much real data there no way to know.
my_df <- read.table(header=TRUE, text=
"V1 V2 V3 V4
A X -0.2834111 -1.5095923
A X 0.3114088 -0.1706417
B Y 0.2544403 -0.4790589
B X 0.6209947 -1.8988974
C X 1.7428690 -0.2251725")
## define functions outside the foo function - perhaps continual redefinition is slow
## use paste as a fake definition for testing
f <- function(x,y) {paste("f",x,y)}
g <- function(x,y) {paste("g",x,y)}
h <- function(x,y) {paste("h",x,y)}
# define the function to applied
foo <- function(item){
#Produce results, nested ifelse avoids reevaluation
res <- ifelse(item['V1'] == "A", f(item['V1'],item['V2']),
ifelse(item['V1'] == "C", 0,
ifelse(item['V1'] == "B" & item['V2'] != "Y", g(item['V3'],item['V4']),
ifelse(item['V1'] == "B" & item['V2'] == "Y", h(item['V3'],item['V4']),
NA))))
return(res)
}
apply(my_df, 1, foo)
[1] "f A X" "f A X" "h 0.2544403 -0.4790589" "g 0.6209947 -1.8988974"
[5] "0"
The ifelse() function isn't know for being very fast. Direct indexing is typically faster
foo <- function(V1,V2,V3,V4){
#Define Functions
f <- function(x, y) paste(x,y)
g <- function(x, y) pmax(x,y)
h <- function(x, y) exp(-1*x)/(y+4)
#Produce results
res <- rep(0, length(V1))
idx <- V1 == "A"
res[idx] <- f(V1[idx],V2[idx])
idx <- V1 == "B" & V2 != "Y"
res[idx] <- g(V3[idx],V4[idx])
idx <- V1 == "B" & V2 == "Y"
res[idx] <- h(V3[idx],V4[idx])
return(res)
}
This should minimize the number of calculations.
You should also consider this:
Assumption: df is the data frame to be considered.
library(data.table)
setDT(df)
test <- function(x){
if (x$V1[1] == 'A')
return (f(x$V3,x$V4))
else if (x$V1[1] == 'C')
return (rep(0,nrow(x)))
else if (x$V1[1] == 'B' && x$V2[1] == 'Y')
return (h(x$V3,x$V4))
else
return (g(x$V3,x$V4))
}
df[,test(.SD),by=c('V1','V2'),.SDcols = colnames(df)]
I felt like being very explicit and human-readable today for some reason. Here's my solution:
## data
df <- data.frame(V1=c('A','A','B','B','C'),V2=c('X','X','Y','X','X'),V3=c(-0.2834111,0.3114088,0.2544403,0.6209947,1.7428690),V4=c(-1.5095923,-0.1706417,-0.4790589,-1.8988974,-0.2251725),stringsAsFactors=F);
## map of functions
funs <- list(
zero=function(x,y) 0,
mult=function(x,y) x*y,
exp=function(x,y) exp(-1*x)/y,
pmax=function(x,y) pmax(x,y)
);
## encapsulate logic that transforms V1,V2 space to function space
vgrp.to.fungrp <- function(V1,V2)
ifelse(V1=='A','mult',
ifelse(V1=='C','zero',
ifelse(V1=='B',
ifelse(V2=='Y','exp','pmax'),
'error'
)
)
);
## run it to get function grouping
fungrps <- vgrp.to.fungrp(df$V1,df$V2);
fungrps;
## [1] "mult" "mult" "exp" "pmax" "zero"
## use ave() to run each represented function once for the set of rows that map to it
ave(seq_len(nrow(df)),fungrps,FUN=function(ri) funs[[fungrps[ri[1L]]]](df$V3[ri],df$V4[ri]));
## [1] 0.42783521 -0.05313933 -1.61848645 0.62099470 0.00000000

R, pass column name as argument to function using dplyr::filter() and %in%

How can I pass a column name in a function similar to the question here but using dplyr chaining and filter() together with %in%.
require(dplyr)
set.seed(8)
df <- data.frame(
A=sample(c(1:3), 10, replace=T),
B=sample(c(1:3), 10, replace=T))
If want to get rows where column A is 1 or 2 I can do:
df %>% filter(A %in% c(1,2))
I get:
A B
1 2 3
2 1 2
3 1 3
4 2 1
5 1 1
6 1 3
Now, how can I put this in a function, where one can specify the column, this does not work:
fun1 <- function(x, column, n){
res <-
x %>% filter(column %in% n)
return(res)
}
fun1(df, A, c(1,2))
You could try
fun1 <- function(x, column, n){
x %>%
filter_(lazyeval::interp(quote(x %in% y), x=as.name(column), y=n))
}
fun1(df, 'A', 1:2)
Or
fun2 <- function(x, column, n){
args <- as.list(match.call())
x %>%
filter(eval(args$column, x) %in% n)
}
fun2(df, A, 1:2)
If you want to keep your function, try:
fun1 <- function(x, column, n){
res <- x %>% filter_(paste(column,"%in%",n))
return(res)
}
fun1(df, "A", "c(1,2)")
Try changing your function to
fun1 <- function(x, column, n){
require(lazyeval)
filter_(x,
interp(quote(col %in% n),
col = lazy(column), n = n))
}
all(fun1(df, A, c(1, 2)) == filter(df, A %in% c(1,2)))
# TRUE

Resources