I want to create a function in R that will create a numerical column based on a character/categorical column. In order to do this I need to get the distinct values in the categorical column. I can do this outside a function well, but would like to make a reusable function to do it. The issue I've run into is that the same distinct() formula that works outside the function doesn't behave the same way within the formula. I've created a demo below:
# test of call to db to numericize
DF <- data.frame("a" = c("a","b","c","a","b","c"),
"b" = paste(0:5, ".1", sep = ""),
"c" = letters[1:6],
stringsAsFactors = FALSE)
catnum <- function(db, inputcolname) {
x <- distinct(db,inputcolname);
print(x);
return(x);
}
y <- distinct(DF,a)
y
catnum(DF,'a')
While y gives the correct distinct one column answer (one column with (a,b,c) in it), x within the function is the entire dataframe. I have tried with and without the ' ', as in catnum(DF,a) but the results are the same.
Could someone tell me what is happening or suggest some code that would work?
One solution is to use distinct_ function inside function. The distinct expect column name and it doesn't work with column names in a variable.
For example distinct(DF, "a") will not work. The actual syntax is: distinct(DF, a). Notice the missing quotes. When distinct is called from function then column name was provided as variable name (i.e inputcolname) which was evaluated. Hence unexpected result. But distinct_ works on variable name for columns.
library(dplyr)
catnum <- function(db, inputcolname) {
x <- distinct_(db,inputcolname);
#print(x);
return(x);
}
#With modified function results were as expected.
catnum(DF,'a')
# a
# 1 a
# 2 b
# 3 c
Not sure what you are trying to do and where distinct function is coming from. Are you looking for this?
catnum<-function(DF,var){
length(unique(DF[[var]]))
}
catnum(DF,'a')
You're inputs are not the same, and so you get different results. If you give distinct the same arguments you give catnum, you will get the same result:
isTRUE(all.equal(distinct(DF, a),
catnum(DF, "a")))
## [1] FALSE
isTRUE(all.equal(distinct(DF, "a"),
catnum(DF, "a")))
##[1] TRUE
Unfortunately, this does not work:
catnum(DF, a)
## a b c
## 1 a 0.1 a
## 2 b 1.1 b
## 3 c 2.1 c
## 4 a 3.1 d
## 5 b 4.1 e
## 6 c 5.1 f
The reason, as explained in
vignette("programming")
is that you must jump through several annoying hoops if you want to write functions that use functions from dplyr. The solution (as you will learn in the vignette) is as follows:
catnum <- function(db, inputcolname) {
inputcolname <- enquo(inputcolname)
distinct(db, !!inputcolname)
}
catnum(DF, a)
## a
## 1 a
## 2 b
## 3 c
Or you could conclude that this is all too confusing and do something like
catnum <- function(db, inputcolname) {
unique(db[, inputcolname, drop = FALSE])
}
catnum(DF, "a")
## a
## 1 a
## 2 b
## 3 c
instead.
Related
I have a list of data.frames. I want to send each data.frame to a function using lapply. Inside the function I want to check whether the name of a data.frame includes a particular string. If the string in question is present I want to perform one series of operations. Otherwise I want to perform a different series of operations. I cannot figure out how to check whether the string in question is present from within the function.
I wish to use base R. This seems to be a possible solution but I cannot get it to work:
In R, how to get an object's name after it is sent to a function?
Here is an example list followed by an example function further below.
matrix.apple1 <- read.table(text = '
X3 X4 X5
1 1 1
1 1 1
', header = TRUE)
matrix.apple2 <- read.table(text = '
X3 X4 X5
1 1 1
2 2 2
', header = TRUE)
matrix.orange1 <- read.table(text = '
X3 X4 X5
10 10 10
20 20 20
', header = TRUE)
my.list <- list(matrix.apple1 = matrix.apple1,
matrix.orange1 = matrix.orange1,
matrix.apple2 = matrix.apple2)
This operation can check whether each object name contains the string apples
but I am not sure how to use this information inside the function further below.
grepl('apple', names(my.list), fixed = TRUE)
#[1] TRUE FALSE TRUE
Here is an example function. Based on hours of searching and trial-and-error I perhaps am supposed to use deparse(substitute(x)) but so far it only returns x or something similar.
table.function <- function(x) {
# The three object names are:
# 'matrix.apple1', 'matrix.orange1' and 'matrix.apple2'
myObjectName <- deparse(substitute(x))
print(myObjectName)
# perform a trivial example operation on a data.frame
my.table <- table(as.matrix(x))
# Test whether an object name contains the string 'apple'
contains.apple <- grep('apple', myObjectName, fixed = TRUE)
# Use the result of the above test to perform a trivial example operation.
# With my code 'my.binomial' is always given the value of 0 even though
# 'apple' appears in the name of two of the data.frames.
my.binomial <- ifelse(contains.apple == 1, 1, 0)
return(list(my.table = my.table, my.binomial = my.binomial))
}
table.function.output <- lapply(my.list, function(x) table.function(x))
These are the results of print(myObjectName):
#[1] "x"
#[1] "x"
#[1] "x"
table.function.output
Here are the rest of the results of table.function showing that my.binomial is always 0.
The first and third value of my.binomial should be 1 because the names of the first and third data.frames contain the string apple.
# $matrix.apple1
# $matrix.apple1$my.table
# 1
# 6
# $matrix.apple1$my.binomial
# logical(0)
#
# $matrix.orange1
# $matrix.orange1$my.table
# 10 20
# 3 3
# $matrix.orange1$my.binomial
# logical(0)
#
# $matrix.apple2
# $matrix.apple2$my.table
# 1 2
# 3 3
# $matrix.apple2$my.binomial
# logical(0)
You could redesign your function to use the list names instead:
table_function <- function(myObjectName) {
# The three object names are:
# 'matrix.apple1', 'matrix.orange1' and 'matrix.apple2'
myObject <- get(myObjectName)
print(myObjectName)
# perform a trivial example operation on a data.frame
my.table <- table(as.matrix(myObject))
# Test whether an object name contains the string 'apple'
contains.apple <- grep('apple', myObjectName, fixed = TRUE)
# Use the result of the above test to perform a trivial example operation.
# With my code 'my.binomial' is always given the value of 0 even though
# 'apple' appears in the name of two of the data.frames.
my.binomial <- +(contains.apple == 1)
return(list(my.table = my.table, my.binomial = my.binomial))
}
lapply(names(my.list), table_function)
This returns
[[1]]
[[1]]$my.table
1
6
[[1]]$my.binomial
[1] 1
[[2]]
[[2]]$my.table
10 20
3 3
[[2]]$my.binomial
integer(0)
[[3]]
[[3]]$my.table
1 2
3 3
[[3]]$my.binomial
[1] 1
If you want to keep the list names, you could use
sapply(names(my.list), table_function, simplify = FALSE, USE.NAMES = TRUE)
instead of lapply.
Use Map and pass both list data and it's name to the function. Change your function to accept two arguments.
table.function <- function(data, name) {
# The three object names are:
# 'matrix.apple1', 'matrix.orange1' and 'matrix.apple2'
print(name)
# perform a trivial example operation on a data.frame
my.table <- table(as.matrix(data))
# Test whether an object name contains the string 'apple'
contains.apple <- grep('apple', name, fixed = TRUE)
# Use the result of the above test to perform a trivial example operation.
# With my code 'my.binomial' is always given the value of 0 even though
# 'apple' appears in the name of two of the data.frames.
my.binomial <- as.integer(contains.apple == 1)
return(list(my.table = my.table, my.binomial = my.binomial))
}
Map(table.function, my.list, names(my.list))
#[1] "matrix.apple1"
#[1] "matrix.orange1"
#[1] "matrix.apple2"
#$matrix.apple1
#$matrix.apple1$my.table
#1
#6
#$matrix.apple1$my.binomial
#[1] 1
#$matrix.orange1
#$matrix.orange1$my.table
#10 20
# 3 3
#$matrix.orange1$my.binomial
#integer(0)
#...
#...
The same functionality is provided by imap in purrr where you don't need to explicitly pass the names.
purrr::imap(my.list, table.function)
I'm working with multiple big data frames in R and I'm trying to write functions that can modify each of them (given a set of common parameters). One function is giving me trouble (shown below).
RawData <- function(x)
{
for(i in 1:nrow(x))
{
if(grep(".DERIVED", x[i,]) >= 1)
{
x <- x[-i,]
}
}
for(i in 1:ncol(x))
{
if(is.numeric(x[,i]) != TRUE)
{
x <- x[,-i]
}
}
return(x)
}
The objective of this function is twofold: first, to remove any rows that contain a ".DERIVED" string in any one of their cells (using grep), and second, to remove any columns that are non-numeric (using is.numeric). I get an error on the following condition:
if(grep(".DERIVED", x[i,]) >= 1)
The error states the "argument is of zero length", which I believe is usually associated with NULL values in a vector. However, I've used is.null on the entire data frame that is giving me errors, and it confirmed that there are no null values in the DF. I'm sure I'm missing something relatively simple here. Any advice would be greatly appreciated.
If you can use non-base-R functions, this should address your issue. df is the data.frame in question here. It will also be faster than looping over rows (generally not advised if avoidable).
library(dplyr)
library(stringr)
df %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
You can make it a function just as you would anything else:
mattsFunction <- function(dat){
dat %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
}
you should probably give it a better name though
The error is from the line
if(grep(".DERIVED", x[i,]) >= 1)
When grep doesn't find the term ".DERIVED", it returns something of zero length, your inequality doesn't return TRUE or FALSE, but rather returns logical(0). The error is telling you that the if statement cannot evaluate whether logical(0) >= 1
A simple example:
if(grep(".DERIVED", "1234.DERIVEDabcdefg") >= 1) {print("it works")} # Works nicely, since the inequality can be evaluated
if(grep(".DERIVED", "1234abcdefg") > 1) {print("no dice")}
You can replace that line with if(length(grep(".DERIVED", x[i,])) != 0)
There's something else you haven't noticed yet, which is that you're removing rows/columns in a loop. Say you remove the 5th column, the next loop iteration (when i = 6) will be handling what was the 7th row! (this will end in an error along the lines of Error in[.data.frame(x, , i) : undefined columns selected)
I prefer using dplyr, but if you need to use base R functions there are ways to to this without if statements.
Notice that you should consider using the regex version of "\\.DERIVED" and not ".DERIVED" which would mean "any character followed by DERIVED".
I don't have example data or output, so here's my best go...
# Made up data
test <- data.frame(a = c("data","data.DERIVED","data","data","data.DERIVED"),
b = (c(1,2,3,4,5)),
c = c("A","B","C","D","E"),
d = c(2,5,6,8,9),
stringsAsFactors = FALSE)
# Note: The following code assumes that the column class is numeric because the
# example code provided assumed that the column class was numeric. This will not
# detects if the column is full of a string of character values of only numbers.
# Using the base subset command
test2 <- subset(test,
subset = !grepl("\\.DERIVED",test$a),
select = sapply(test,is.numeric))
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
# Trying to use []. Note: If only 1 column is numeric this will return a vector
# instead of a data.frame
test2 <- test[!grepl("\\.DERIVED",test$a),]
test2 <- test2[,sapply(test,is.numeric)]
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
I'm trying to call a vector "a" from a data frame "df" using a function. I know I could do this just fine with the following:
> df$a
[1] 1 2 3
But I'd like to use a function where both the data frame and vector names are input separately as arguments. This is the best that I've come up with:
show_vector <- function(data.set, column) {
data.set$column
}
But here's how it goes when I try it out:
> show_vector(df, a)
NULL
How could I change this function in order to successfully reference vector df$a where the names of both are input to a function as arguments?
It's actually possible to do this without passing the column name as a string (in other words, you can pass in the unquoted column name:
show_vector <- function(data.set, column) {
eval(substitute(column), envir = data.set)
}
Usage example:
df <- data.frame(a = 1:3, b = 4:6)
show_vector(df, b)
# 4 5 6
I've wondered about this kind of thing a lot in the past and haven't found an easy fix. The best I've come up with is this:
df <- data.frame(c(1, 2, 3), c(4, 5, 6))
colnames(df) <- c("A", "B")
test <- function(dataframe, columnName) {
return(dataframe[, match(columnName, colnames(dataframe))])
}
test(df, "A")
Your code would work if you only put the column name in quotes i.e. show_vector(df, "a")
Other multiple ways to do this:
Using base functionality
func <- function(df, cname){
return(df[, grep(cname, colnames(df))])
}
Or even
func <- function(df, cname){
return(df[, cname])
}
You can use substitute to capture the input vector name as it is then use `as.character to make it as a character.
show_vector <- function(data.set, column) {
data.set[,as.character(substitute(column))]
}
Now lets take a look:
(dat=data.frame(a=1:3,b=4:6,c=10:12))
a b c
1 1 4 10
2 2 5 11
3 3 6 12
show_vector(dat,a)
[1] 1 2 3
show_vector(dat,"a")
[1] 1 2 3
It works.
we can also write a simple one where we just input a character string:
show_vector1 <- function(data.set, column) {
data.set[,column]
}
show_vector1(dat,"a")
[1] 1 2 3
Although this will not work if the column name is not a character:
show_vector1(dat,a)
**Show Traceback
Rerun with Debug
Error in `[.data.frame`(data.set, , column) : undefined columns selected**
It is of course possible to store functions in a list to call it.
It is also possible to name that list entry to have a better access to it later.
Now I need the list item name to be a regular expression like this:
funcList <- list("^\\+[0-9]{1,3}$"=lead, "^\\-[0-9]{1,3}$"=lag)
a <- funcList$"+12"(a,12) # this will fire function "lead"
a <- funcList$"-4"(a,-4) # this will fire function "lag"
a <- funcList$"^\\+[0-9]{1,3}$"(a,12) # this works of course but is not what I want...
Of course this is not working correctly and I am getting the error "Error: attempt to apply non-function" because it is not used as regex but as a normal string value.
Is it possible to do what I need?
You could use the names of the array as parameters for grepl:
funcList <- list("^\\+[0-9]{1,3}$"=lead, "^\\-[0-9]{1,3}$"=lag)
f1 <- funcList[sapply(names(funcList), function(x) grepl(x,"+12"))][[1]]
f2 <- funcList[sapply(names(funcList), function(x) grepl(x,"-4"))][[1]]
> f1(seq(1,10))
[1] 2 3 4 5 6 7 8 9 10 NA
> f2(seq(1,10))
[1] NA 1 2 3 4 5 6 7 8 9
I think you can map strings like "+4" and "-12" to lead/lag more straightforwardly like:
set.seed(123)
df = data.frame(
x = sample(1:20, 10)
)
shifted = function(x, shift) {
direction = substr(shift, 1, 1)
amount = as.integer(substr(shift, 2, nchar(shift)))
if (direction == "+") {
return(lead(x, amount))
} else {
return(lag(x, amount))
}
}
df %>%
mutate(
plus4 = shifted(x, "+4"),
minus3 = shifted(x, "-3")
)
You could use regex within the shifted function if you need to do more validation of the "+4" strings, but I prefer not to go for complicated regexes unless they're definitely needed.
In my code, I am filling the columns of a dataframe with vectors, as so:
df1[columnNum] <- barWidth
This works fine, except for one thing: I want the name of the vector variable (barWidth above) to be retained as the column header, one column at a time. Furthermore, I do not wish to use cbind. This slows the execution of my code down considerably. Consequently, I am using a pre-allocated dataframe.
Can this be done in the vector-to-column assignment? If not, then how do I change it after the fact? I can't find the right syntax to do this with colNames().
TIA
It's being done by the [<-.data.frame function. It could conceivably be replaced by one that looked at the name of the argument but it's such a fundamental function I would be hesitant. Furthermore there appears to be an aversion to that practice signaled by this code at the top of the function definition:
> `[<-.data.frame`
function (x, i, j, value)
{
if (!all(names(sys.call()) %in% c("", "value")))
warning("named arguments are discouraged")
nA <- nargs()
if (nA == 4L) {
<snipped rest of rather long definition>
I don't know why that is there, but it is. Maybe you should either be thinking about using names<- after the column assignment, or using this method:
> dfrm["barWidth"] <- barWidth
> dfrm
a V2 barWidth
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
This can be generalized to a list of new columns:
dfrm <- data.frame(a=letters[1:4])
barWidth <- 1:4
newcols <- list(barWidth=barWidth, bw2 =barWidth)
dfrm[names(newcol)] <- newcol
dfrm
#
a barWidth bw2
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
If you have the list of names of vectors you want to apply you could do:
namevec <- c(...,"barWidth"...,)
columnNums <- c(...,10,...)
df1[columnNums[i]] <- get(namevec[i])
names(df1)[columnNums[i]] <- namevec[i]
or even
columnNums <- c(barWidth=4,...)
for (i in seq_along(columnNums)) {
df1[columnNums[i]] <- get(names(columnNums)[i])
}
names(df1)[columnNums] <- names(columnNums)
but the deeper question would be where this set of vectors is coming from in the first place: could you have them in a list all along?
I'd simply use cbind():
df1 <- cbind( df1, barWidth )
which retains the name. It will, however, end up as the last column in df1