I have a list of data.frames. I want to send each data.frame to a function using lapply. Inside the function I want to check whether the name of a data.frame includes a particular string. If the string in question is present I want to perform one series of operations. Otherwise I want to perform a different series of operations. I cannot figure out how to check whether the string in question is present from within the function.
I wish to use base R. This seems to be a possible solution but I cannot get it to work:
In R, how to get an object's name after it is sent to a function?
Here is an example list followed by an example function further below.
matrix.apple1 <- read.table(text = '
X3 X4 X5
1 1 1
1 1 1
', header = TRUE)
matrix.apple2 <- read.table(text = '
X3 X4 X5
1 1 1
2 2 2
', header = TRUE)
matrix.orange1 <- read.table(text = '
X3 X4 X5
10 10 10
20 20 20
', header = TRUE)
my.list <- list(matrix.apple1 = matrix.apple1,
matrix.orange1 = matrix.orange1,
matrix.apple2 = matrix.apple2)
This operation can check whether each object name contains the string apples
but I am not sure how to use this information inside the function further below.
grepl('apple', names(my.list), fixed = TRUE)
#[1] TRUE FALSE TRUE
Here is an example function. Based on hours of searching and trial-and-error I perhaps am supposed to use deparse(substitute(x)) but so far it only returns x or something similar.
table.function <- function(x) {
# The three object names are:
# 'matrix.apple1', 'matrix.orange1' and 'matrix.apple2'
myObjectName <- deparse(substitute(x))
print(myObjectName)
# perform a trivial example operation on a data.frame
my.table <- table(as.matrix(x))
# Test whether an object name contains the string 'apple'
contains.apple <- grep('apple', myObjectName, fixed = TRUE)
# Use the result of the above test to perform a trivial example operation.
# With my code 'my.binomial' is always given the value of 0 even though
# 'apple' appears in the name of two of the data.frames.
my.binomial <- ifelse(contains.apple == 1, 1, 0)
return(list(my.table = my.table, my.binomial = my.binomial))
}
table.function.output <- lapply(my.list, function(x) table.function(x))
These are the results of print(myObjectName):
#[1] "x"
#[1] "x"
#[1] "x"
table.function.output
Here are the rest of the results of table.function showing that my.binomial is always 0.
The first and third value of my.binomial should be 1 because the names of the first and third data.frames contain the string apple.
# $matrix.apple1
# $matrix.apple1$my.table
# 1
# 6
# $matrix.apple1$my.binomial
# logical(0)
#
# $matrix.orange1
# $matrix.orange1$my.table
# 10 20
# 3 3
# $matrix.orange1$my.binomial
# logical(0)
#
# $matrix.apple2
# $matrix.apple2$my.table
# 1 2
# 3 3
# $matrix.apple2$my.binomial
# logical(0)
You could redesign your function to use the list names instead:
table_function <- function(myObjectName) {
# The three object names are:
# 'matrix.apple1', 'matrix.orange1' and 'matrix.apple2'
myObject <- get(myObjectName)
print(myObjectName)
# perform a trivial example operation on a data.frame
my.table <- table(as.matrix(myObject))
# Test whether an object name contains the string 'apple'
contains.apple <- grep('apple', myObjectName, fixed = TRUE)
# Use the result of the above test to perform a trivial example operation.
# With my code 'my.binomial' is always given the value of 0 even though
# 'apple' appears in the name of two of the data.frames.
my.binomial <- +(contains.apple == 1)
return(list(my.table = my.table, my.binomial = my.binomial))
}
lapply(names(my.list), table_function)
This returns
[[1]]
[[1]]$my.table
1
6
[[1]]$my.binomial
[1] 1
[[2]]
[[2]]$my.table
10 20
3 3
[[2]]$my.binomial
integer(0)
[[3]]
[[3]]$my.table
1 2
3 3
[[3]]$my.binomial
[1] 1
If you want to keep the list names, you could use
sapply(names(my.list), table_function, simplify = FALSE, USE.NAMES = TRUE)
instead of lapply.
Use Map and pass both list data and it's name to the function. Change your function to accept two arguments.
table.function <- function(data, name) {
# The three object names are:
# 'matrix.apple1', 'matrix.orange1' and 'matrix.apple2'
print(name)
# perform a trivial example operation on a data.frame
my.table <- table(as.matrix(data))
# Test whether an object name contains the string 'apple'
contains.apple <- grep('apple', name, fixed = TRUE)
# Use the result of the above test to perform a trivial example operation.
# With my code 'my.binomial' is always given the value of 0 even though
# 'apple' appears in the name of two of the data.frames.
my.binomial <- as.integer(contains.apple == 1)
return(list(my.table = my.table, my.binomial = my.binomial))
}
Map(table.function, my.list, names(my.list))
#[1] "matrix.apple1"
#[1] "matrix.orange1"
#[1] "matrix.apple2"
#$matrix.apple1
#$matrix.apple1$my.table
#1
#6
#$matrix.apple1$my.binomial
#[1] 1
#$matrix.orange1
#$matrix.orange1$my.table
#10 20
# 3 3
#$matrix.orange1$my.binomial
#integer(0)
#...
#...
The same functionality is provided by imap in purrr where you don't need to explicitly pass the names.
purrr::imap(my.list, table.function)
Related
HEADLINE: Is there a way to get R to recognize data.frame column names contained within lists in the same way that it can recognize free-floating vectors?
SETUP: Say I have a vector named varA:
(varA <- 1:6)
# [1] 1 2 3 4 5 6
To get the length of varA, I could do:
length(varA)
#[1] 6
and if the variable was contained within a larger list, the variable and its length could still be found by doing:
list <- list(vars = "varA")
length(get(list$vars[1]))
#[1] 6
PROBLEM:
This is not the case when I substitute the vector for a dataframe column and I don't know how to work around this:
rows <- 1:6
cols <- c("colA")
(df <- data.frame(matrix(NA,
nrow = length(rows),
ncol = length(cols),
dimnames = list(rows, cols))))
# colA
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 NA
list <- list(vars = "varA",
cols = "df$colA")
length(get(list$vars[1]))
#[1] 6
length(get(list$cols[1]))
#Error in get(list$cols[1]) : object 'df$colA' not found
Though this contrived example seems inane, because I could always use the simple length(variable) approach, I'm actually interested in writing data from hundreds of variables varying in lengths onto respective dataframe columns, and so keeping them in a list that I could iterate through would be very helpful. I've tried everything I could think of, but it may be the case that it's just not possible in R, especially given that I cannot find any posts with solutions to the issue.
You could try:
> length(eval(parse(text = list$cols[1])))
[1] 6
Or:
list <- list(vars = "varA",
cols = "colA")
length(df[, list$cols[1]])
[1] 6
Or with regex:
list <- list(vars = "varA",
cols = "df$colA")
length(df[, sub(".*\\$", "", list$cols[1])])
[1] 6
If you are truly working with a data frame d, then nrow(d) is the length of all of the variables in d. There should be no reason to use length in this case.
If you are actually working with a list x containing variables of potentially different lengths, then you should use the [[ operator to extract those variables by name (see ?Extract):
x <- list(a = 1:10, b = rnorm(20L))
l <- list(vars = "a")
length(d[[l$vars[1L]]]) # 10
If you insist on using get (you shouldn't), then you need to supply a second argument telling it where to look for the variable (see ?get):
length(get(l$vars[1L], x)) # 10
I am creating a function that takes a list of user-specified words and then labels them as a number depending on the order of the number in the list. The user can specify different list lengths.
For example:
myNotableWords<-c("No_IM","IM","LGD","HGD","T1a")
aa<-c("No_IM","IM","No_IM","HGD","T1a","HGD","T1a","IM","LGD")
aa<-data.frame(aa,stringsAsFactors=FALSE)
Intended Output
new<-(1,2,1,4,5,4,5,2,3)
Is there a way of maybe getting the index of the original list and then looking up where the each element of the target list is in that index and replacing it with the index number?
Why not just use the factor functionality of R?
A "factor data type" stores an integer that references a "level" (= character string) via the index number:
myNotableWords<-c("No_IM","IM","LGD","HGD","T1a")
aa<-c("No_IM","IM","No_IM","HGD","T1a","HGD","T1a","IM","LGD")
aa <- as.integer(factor(aa, myNotableWords, ordered = TRUE))
aa
# [1] 1 2 1 4 5 4 5 2 3
new <- c()
for (item in aa) {
new <- c(new, which(myNotableWords == item))
}
print(new)
#[1] 1 2 1 4 5 4 5 2 3
You can do this using data.frame; the syntax shouldn't change. I prefer using data.table though.
library(data.table)
myWords <- c("No_IM","IM","LGD","HGD","T1a")
myIndex <- data.table(keywords = myWords, word_index = seq(1, length(myWords)))
The third line simply adds an index to the vector myWords.
aa <- data.table(keywords = c("No_IM","IM","No_IM","HGD","T1a",
"HGD","T1a","IM","LGD"))
aa <- merge(aa, myIndex, by = "keywords", all.x = TRUE)
And now you have a table that shows the keyword and its unique number.
I want to create a function in R that will create a numerical column based on a character/categorical column. In order to do this I need to get the distinct values in the categorical column. I can do this outside a function well, but would like to make a reusable function to do it. The issue I've run into is that the same distinct() formula that works outside the function doesn't behave the same way within the formula. I've created a demo below:
# test of call to db to numericize
DF <- data.frame("a" = c("a","b","c","a","b","c"),
"b" = paste(0:5, ".1", sep = ""),
"c" = letters[1:6],
stringsAsFactors = FALSE)
catnum <- function(db, inputcolname) {
x <- distinct(db,inputcolname);
print(x);
return(x);
}
y <- distinct(DF,a)
y
catnum(DF,'a')
While y gives the correct distinct one column answer (one column with (a,b,c) in it), x within the function is the entire dataframe. I have tried with and without the ' ', as in catnum(DF,a) but the results are the same.
Could someone tell me what is happening or suggest some code that would work?
One solution is to use distinct_ function inside function. The distinct expect column name and it doesn't work with column names in a variable.
For example distinct(DF, "a") will not work. The actual syntax is: distinct(DF, a). Notice the missing quotes. When distinct is called from function then column name was provided as variable name (i.e inputcolname) which was evaluated. Hence unexpected result. But distinct_ works on variable name for columns.
library(dplyr)
catnum <- function(db, inputcolname) {
x <- distinct_(db,inputcolname);
#print(x);
return(x);
}
#With modified function results were as expected.
catnum(DF,'a')
# a
# 1 a
# 2 b
# 3 c
Not sure what you are trying to do and where distinct function is coming from. Are you looking for this?
catnum<-function(DF,var){
length(unique(DF[[var]]))
}
catnum(DF,'a')
You're inputs are not the same, and so you get different results. If you give distinct the same arguments you give catnum, you will get the same result:
isTRUE(all.equal(distinct(DF, a),
catnum(DF, "a")))
## [1] FALSE
isTRUE(all.equal(distinct(DF, "a"),
catnum(DF, "a")))
##[1] TRUE
Unfortunately, this does not work:
catnum(DF, a)
## a b c
## 1 a 0.1 a
## 2 b 1.1 b
## 3 c 2.1 c
## 4 a 3.1 d
## 5 b 4.1 e
## 6 c 5.1 f
The reason, as explained in
vignette("programming")
is that you must jump through several annoying hoops if you want to write functions that use functions from dplyr. The solution (as you will learn in the vignette) is as follows:
catnum <- function(db, inputcolname) {
inputcolname <- enquo(inputcolname)
distinct(db, !!inputcolname)
}
catnum(DF, a)
## a
## 1 a
## 2 b
## 3 c
Or you could conclude that this is all too confusing and do something like
catnum <- function(db, inputcolname) {
unique(db[, inputcolname, drop = FALSE])
}
catnum(DF, "a")
## a
## 1 a
## 2 b
## 3 c
instead.
I am having a problem with get() in R.
I have a set of data.frames with a common structure in my environment. I want to loop through these data frames and change the name of the 2nd column so that the name of the 2nd column contains a prefix from the 1st column.
For example, if column 1 = A_cat and column 2 is dog, I want column 2 to be changed to A_dog.
Below is an example of the R code I am using:
df <- data.frame('A_cat'= 1:10 , 'dog' = 11:20)
for( element in grep('^df$', names(environment()), value=TRUE) ) {
colnames(get(element))[2] <- paste(strsplit(colnames(get(element)) [1], '`_`')[[1]][1],
colnames(get(element))[2], sep='`_`')
}
The arguments within the for loop, on either side of the assignment operator, both give the expected result if I run them separately but when run together produce the following error.
Error in colnames(get(element))[2] <- paste(strsplit(colnames(get(element))[1], :
could not find function "get<-"
Any help with this problem would be greatly appreciated.
This does the same thing as the code in the question without using get:
df <- data.frame('A_cat'= 1:10 , 'dog' = 11:20)
e <- environment() ##
df.names <- grep("^df$", names(e), value = TRUE)
# nm is the current data frame name and nms are its column names
for(nm in df.names) {
nms <- names(e[[nm]])
names(e[[nm]])[2] <- paste0(sub("_.*", "_", nms[1]), nms[2])
}
giving:
> df
A_cat A_dog
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
Keeping the data.frames in a named list as suggested in a comment to the question might be even better. For example, if instead of keeping the data.frames in an environment they were in a list called e
e <- list(df = df)
then omit the line marked ## and the rest works as is.
Here would be one way to accomplish this goal if the data.frames have systematic names (here, df1 df2 df3, etc) and the prefix ends with "_" as in the example:
# suggested by #roland roll them up in a list:
myDfList <- mget(ls(pattern="^df"))
# change names
for(dfName in names(myDfList)) {
names(myDfList[[dfName]])[2] <- paste0(gsub("^(.*_)", "\\1",
names(myDfList[[dfName]])[1]),
names(myDfList[[dfName]])[2])
}
I have some R command like this
subset(
(aggregate(cbind(var1,var2)~Ei+Mi+hours,a, FUN=mean)),
(aggregate(cbind(var1,var2)~Ei+Mi+hours,a, FUN=mean))$Ei == c(1:EXP)
)
I want to do
1) Ask the user to input the var1 and var2
2) Get those variables into the subset command line as shown above and
continue with other things.
Note: for reading the user input I have variables like
c(ax,bx,cx,dx,ex,fx,gx,hx,ix,jx,kx,lx,mx,nx,ox) = c(1:15) and each
variable is mapped to number 1 to 15. So displaying this for user and
asking the user to select any number between 1 to 15 and then
checking the corresponding variable for the entered number and
reading this into the command line is whats the best method, I think.
So how can I implement this?
Regarding the answer:
Just wondering there is one possible scenario like , if the user wants to enter multiple of numbers in one go. [ex: 1,2,3]...than how to read this using readlines as said in the answer below using
v1 <- quote(var1 <- as.numeric(readline('Enter Variable 1: ')))
eavl(v1)
xx <- paste0(letters[1:15], 'x')
xx[v1]
How to read multiple variables in this case?
Here's a rough example of the readline interactive prompt. When v1 is evaluated, the user will be prompted to enter a value. That value is then stored as var1.
> v1 <- quote(var1 <- as.numeric(readline('Enter Variable 1: ')))
> eval(v1)
Enter Variable 1: 1000 ## user enters 1000, for example
> 100 + var1 + 50 ## example to show captured output as object
## [1] 1150
So in your case it might go something like
> v1 <- quote(var1 <- as.numeric(readline('Enter a number from 1 to 15: ')))
> eval(v1)
Enter a number from 1 to 15: 7
> var1
## [1] 7
> xx <- paste0(letters[1:15], 'x')
> xx
## [1] "ax" "bx" "cx" "dx" "ex" "fx" "gx" "hx" "ix" "jx" "kx" "lx" "mx" "nx" "ox"
> xx[var1]
## [1] "gx"
I borrowed this idea for a function from this older SO post. You can return the output invisibly and it will still take in the user values.
input.fun <- function(){
v1 <- readline("var1: ")
v2 <- readline("var2: ")
v3 <- readline("var3: ")
v4 <- readline("var4: ")
v5 <- readline("var5: ")
out <- sapply(c(v1, v2, v3, v4, v5), as.numeric, USE.NAMES = FALSE)
invisible(out)
}
> x <- input.fun()
var1: 7
var2: 4
var3: 8
var4: 5
var5: 2
> x
[1] 7 4 8 5 2
In response to your edit: I'm not sure if this is the standard method for reading multiple numbers in one line, but it works.
> xx <- readline('Enter numbers separated by a space: ')
Enter numbers separated by a space: 4 12 67 9 2
> as.numeric(strsplit(xx, ' ')[[1]])
## [1] 4 12 67 9 2
Here's a possibility using scan()
#sample data
df<-data.frame(
ax=runif(50),
bx=runif(50),
cx=runif(50),
dx=runif(50),
Ei=sample(letters[1:5], 50, replace=T)
)
#get vars
vars<-c(NA,NA)
while(any(is.na(vars))) {
cat(paste("enter var number", sum(!is.na(vars))+1),"\n")
cat(paste(seq_along(names(df)), ":", names(df)), sep="\n")
try(n<-scan(what=integer(), nmax=1), silent=T)
vars[min(which(is.na(vars)))]<-n
}
#--pause
#use vars
subset(aggregate(df[,vars], df[,c("Ei"), drop=F], FUN=mean), Ei=="a")
It's not super robust, but if you copy the first half (before the pause) it will ask you for two variable numbers, and then if you run the second half, it will use those two values. I've adjusted the aggregate and subset to be more appropriate for variable usage which means not using the formula syntax.
I did not do any error checking. That's left as an exercise for the asker.