There's a few questions here, I would be satisfied if any one of them was answered sufficiently well.
Background - what is the end goal?
I am interested in representing a date-range in R. Bare-minimum requirement is that we represent a start and end date, which can easily be done using a length-two date vector. Additionally, it would be nice to extend this object into a Class which further
supplies a name to each range (i.e. a character string)
enables the (easy) use of dplyr::between operator
Shortcomings of my previous approach
I've previously represented each range as a length-two date vector. The upside here is that I don't rely on any external dependencies and my data structure is so lightweight that it's not a hassle to program with. The downside is that I'm tired of having to access the beg and end of the date range via the [ operator and arguments 1 and 2 respectively (arguably less interpretable than if we had a class implementation).
Also, we ultimately deal with a sequence of date-ranges (i.e. a vector), and so abstracting away the DateRange is helpful before we start nesting data structures. I do not want to use a list of length-two date vectors nor do I wish to use a data.frame with two rows, each column being interpreted as a date-range.
Where have I looked?
I've looked at lubridate package and have considered inheriting from a Interval class. The downside to starting with this inheritance is that I don't think S4 is necessary for my use case. I just need a few simple data attributes and a nice API for calling dplyr::between.
An ideal solution might just extend the lubridate::Interval class to hold a name, an end date (could be a method as this info already stored in Interval via #start + #.Data), and extend dplyr::between to play nicely with said class.
What have I tried?
Here's a rough implementation of what I'm looking for:
# 3 key attributes: beg, end, and name.
MyInterval <- function(beg, end, name = NULL) {
if (class(beg) == "character") beg <- as.Date(beg)
if (class(end) == "character") end <- as.Date(end)
if (is.null(name)) name <- as.character(beg)
structure(.Data = list('beg' = beg, 'end' = end, 'name' = name), class = "MyInterval")
}
Now, I would like to be able to overload the between operator such that I may call it as follows: between(x, MyInterval), where we notice that dplyr::between(x, lo, hi) expects three arguments. To try and accomplish this, I've tried to set up type dispatching as follows:
between <- function(...) UseMethod('between')
between.MyInterval <- function(interval, x) {
if (class(x) == "character") x <- as.Date(x)
dplyr::between(x, interval$beg, interval$end)
}
between.default <- function(x, lo, hi) dplyr::between(x, lo, hi)
The reason I chose to use ... in the prototype for between is that the order of arguments currently differ between between.MyInterval and between.default. Is there a better way to code this up? I believe the behavior is as desired (to within a first glance)
i <- MyInterval("2012-01-01", "2012-12-31")
between(i, "2012-02-01") # Dispatches to between.MyInterval. Returns True as expected.
between(150, 100, 200) # Dispatches to dplyr::between. Good, we didn't break anything?
Thank you
Any criticisms are welcomed. I know that between is a function that doesn't do type-dispatching out of the box, and so implementing this myself raises a code smell.
A possibility is to use data.table's inrange-function.
First, let's make an interval:
my.interval <- function(beg, end) data.table(beg = as.Date(beg), end = as.Date(end))
mi <- my.interval("2012-01-01", "2012-12-31")
Now you can do:
> as.Date("2012-02-01") %inrange% mi
[1] TRUE
Or define you own inrange-function:
my.inrange <- function(x, intv) data.table::inrange(as.Date(x), intv$beg, intv$end)
With that you can do:
> my.inrange("2012-02-01", mi)
[1] TRUE
As #Frank commented, you can make an infix variant of my.inrange too:
`%my.inrange%` <- my.inrange
now you can use it in the following notation as well:
"2012-02-01" %my.inrange% mi
Which is similar to the infix notation of data.table's between and inrange functions.
Related
I am using R to parse a list of strings in the form:
original_string <- "variable_name=variable_value"
First, I extract the variable name and value from the original string and convert the value to numeric class.
parameter_value <- as.numeric("variable_value")
parameter_name <- "variable_name"
Then, I would like to assign the value to a variable with the same name as the parameter_name string.
variable_name <- parameter_value
What is/are the function(s) for doing this?
assign is what you are looking for.
assign("x", 5)
x
[1] 5
but buyer beware.
See R FAQ 7.21
http://cran.r-project.org/doc/FAQ/R-FAQ.html#How-can-I-turn-a-string-into-a-variable_003f
You can use do.call:
do.call("<-",list(parameter_name, parameter_value))
There is another simple solution found there:
http://www.r-bloggers.com/converting-a-string-to-a-variable-name-on-the-fly-and-vice-versa-in-r/
To convert a string to a variable:
x <- 42
eval(parse(text = "x"))
[1] 42
And the opposite:
x <- 42
deparse(substitute(x))
[1] "x"
The function you are looking for is get():
assign ("abc",5)
get("abc")
Confirming that the memory address is identical:
getabc <- get("abc")
pryr::address(abc) == pryr::address(getabc)
# [1] TRUE
Reference: R FAQ 7.21 How can I turn a string into a variable?
Use x=as.name("string"). You can use then use x to refer to the variable with name string.
I don't know, if it answers your question correctly.
strsplit to parse your input and, as Greg mentioned, assign to assign the variables.
original_string <- c("x=123", "y=456")
pairs <- strsplit(original_string, "=")
lapply(pairs, function(x) assign(x[1], as.numeric(x[2]), envir = globalenv()))
ls()
assign is good, but I have not found a function for referring back to the variable you've created in an automated script. (as.name seems to work the opposite way). More experienced coders will doubtless have a better solution, but this solution works and is slightly humorous perhaps, in that it gets R to write code for itself to execute.
Say I have just assigned value 5 to x (var.name <- "x"; assign(var.name, 5)) and I want to change the value to 6. If I am writing a script and don't know in advance what the variable name (var.name) will be (which seems to be the point of the assign function), I can't simply put x <- 6 because var.name might have been "y". So I do:
var.name <- "x"
#some other code...
assign(var.name, 5)
#some more code...
#write a script file (1 line in this case) that works with whatever variable name
write(paste0(var.name, " <- 6"), "tmp.R")
#source that script file
source("tmp.R")
#remove the script file for tidiness
file.remove("tmp.R")
x will be changed to 6, and if the variable name was anything other than "x", that variable will similarly have been changed to 6.
I was working with this a few days ago, and noticed that sometimes you will need to use the get() function to print the results of your variable.
ie :
varnames = c('jan', 'feb', 'march')
file_names = list_files('path to multiple csv files saved on drive')
assign(varnames[1], read.csv(file_names[1]) # This will assign the variable
From there, if you try to print the variable varnames[1], it returns 'jan'.
To work around this, you need to do
print(get(varnames[1]))
If you want to convert string to variable inside body of function, but you want to have variable global:
test <- function() {
do.call("<<-",list("vartest","xxx"))
}
test()
vartest
[1] "xxx"
Maybe I didn't understand your problem right, because of the simplicity of your example. To my understanding, you have a series of instructions stored in character vectors, and those instructions are very close to being properly formatted, except that you'd like to cast the right member to numeric.
If my understanding is right, I would like to propose a slightly different approach, that does not rely on splitting your original string, but directly evaluates your instruction (with a little improvement).
original_string <- "variable_name=\"10\"" # Your original instruction, but with an actual numeric on the right, stored as character.
library(magrittr) # Or library(tidyverse), but it seems a bit overkilled if the point is just to import pipe-stream operator
eval(parse(text=paste(eval(original_string), "%>% as.numeric")))
print(variable_name)
#[1] 10
Basically, what we are doing is that we 'improve' your instruction variable_name="10" so that it becomes variable_name="10" %>% as.numeric, which is an equivalent of variable_name=as.numeric("10") with magrittr pipe-stream syntax. Then we evaluate this expression within current environment.
Hope that helps someone who'd wander around here 8 years later ;-)
Other than assign, one other way to assign value to string named object is to access .GlobalEnv directly.
# Equivalent
assign('abc',3)
.GlobalEnv$'abc' = 3
Accessing .GlobalEnv gives some flexibility, and my use case was assigning values to a string-named list. For example,
.GlobalEnv$'x' = list()
.GlobalEnv$'x'[[2]] = 5 # works
var = 'x'
.GlobalEnv[[glue::glue('{var}')]][[2]] = 5 # programmatic names from glue()
I am looping over an estimation and saving all the estimation objects and then pick the one with the lowest deviance. For this, I wanted to use the Filter/Map/Position functions, but I could not find a solution, because it always returns the first object of the list, instead of the second. I probably misunderstand something about how the Position function works, but would like to know what I missed.
MWE:
ls<-list(3,2,4)
Position(min,Map(function(x) {x^2}, ls))
I ended up using unlist and which.min
You ended up doing the correct procedure. Why? First, we'll pull up a salient extract from the help page:
Position(f, x, right = FALSE, nomatch = NA_integer_)
Find and Position are patterned after Common Lisp's find-if and position-if, respectively. If there is an element for which the predicate function gives true, then the first or last such element or its position is returned depending on whether right is false (default) or true, respectively. If there is no such element, the value specified by nomatch is returned. The current implementation is not optimized for performance.
So, Position() is going to apply f() to all elements of x and when the result of f() is TRUE (directly or via coercion) Position() will return the index of that element (if nothing ends up being TRUE then it returns the value assigned to nomatch).
Here's the actual Position() function source:
Position <- function (f, x, right=FALSE, nomatch=NA_integer_) {
ind <- seq_along(x)
if (right) ind <- rev(ind)
for (i in ind) {
if (f(x[[i]])) return(i)
}
nomatch
}
Following the source you can prbly see that min() is getting called on the first element of the list and it returns the value that the min() of the vector at list position 1 and that value is non-zero so it thinks it did a good job and returns the list index it was at.
If you had been doing:
Position(min, Map(function(x) {x-3}, dat))
then you would have seen the result be:
## [1] 2
and possibly have thought it was working, but it's only returning that since the first element of the list of 3 and 3-3 == 0 and 0 is coerced to FALSE.
NOTE: ls is also the name of a base function which is fine since R knows what to do based on usage context but I don't like potentially causing weird errors down the road in function calls by crushing very common core namespace elements so I used dat instead of ls.
The idea behind Position() is more for something like:
eqls4 <- function(x) x==4
Position(eqls4, Map(function(x) {x^2}, dat))
which does return:
## [1] 2
So, what you ended up doing was 100% correct.
Note also that the purrr package provides alternative functional idioms that (IMO) tend to be more readable + have ways of ensuring proper types are maintained and it exports %>% so piping is also readily available:
library(purrr)
map(dat, ~.^2) %>%
flatten_dbl() %>%
which.min()
R has many means of coercion: as.logical, as.integer, as.data.frame, etc.
Using these functions, however, requires knowing in advance ("at compile time") the target class (logical, integer, data.frame, etc.).
But in the problem I'm working on, I need to coerce expr1 to class(expr2), where expr1 and expr2 are arbitrary R expressions whose classes are not known until runtime.
I don't expect that R has any built-in support for this exact use-case, but I hope that it does for the one of coercing an arbitrary expression to an arbitrary class. IOW, I hope that R already has something like coerce(expr, a.class)1.
I know that I can roll my own coerce, like this:
coerce <- function (expr, a.class) {
## coerce expr to a.class
switch(a.class,
character = as.character(expr),
logical = as.logical(expr),
integer = as.integer(expr),
numeric = as.numeric(expr),
# yadda-yadda-yadda
stop(sprintf('unsupported class: %s', a.class)))
}
> coerce(5, "foo")
[1] "5"
Also, I know that one can also resort to this:
coerce2 <- function (expr, a.class) {
## coerce expr to a.class; take 2
rcode <- sprintf("as.%s(expr)", a.class)
tryCatch(eval(parse(text=rcode)),
error = function (condtion) {
stop(sprintf('unsupported class: %s', a.class))
})
}
> coerce2(5, data.frame())
x
1 5
...but I hope that R already supports something more robust and powerful than this.
1 My assumption is that I could then address the use-case I'm actually interested in with an expression of the form coerce(expr1, class(expr2)). I think this is a reasonable assumption, but in case it turns out not to hold, I thought I'd spell all this out.
I have a vector of dates and a vector of number-of-days.
dates <- seq(Sys.Date(), Sys.Date()+10, by='day')
number.of.days <- 1:4
I need to get a list with an entry for each number-of-days, where each entry in the list is the dates vector minus the corresponding number-of-days, ie.,
list(dates-1, dates-2, dates-3, dates-4)
The defintion of - (function (e1, e2) .Primitive("-")) indicates that its first and second arguments are e1 and e2, respectively. So the following should work.
lapply(number.of.days, `-`, e1=dates)
But it raises an error.
Error in -.Date(X[[i]], ...) : can only subtract from "Date" objects
Furthermore, the following does work:
lapply(number.of.days, function(e1, e2) e1 - e2, e1=dates)
Is this a feature or a bug?
You can use:
lapply(number.of.days, `-.Date`, e1=dates)
Part of the problem is - is a primitive which doesn't do argument matching. Notice how these are the same:
> `-`(e1=5, e2=3)
[1] 2
> `-`(e2=5, e1=3)
[1] 2
From R Language Definition:
This subsection applies to closures but not to primitive functions. The latter typically ignore tags and do positional matching, but their help pages should be consulted for exceptions, which include log, round, signif, rep and seq.int.
So in your case, you end up using dates as the second argument to - even though you attempt to specify it as the first. By using the "Date" method for -, which is not a primitive, we can get it to work.
So technically, the behavior you are seeing is a feature, or perhaps a "documented inconsistency". The part that could possibly considered a bug is that R will do a multiple dispatch to a "Date" method for - despite that method not supporting non-date arguments as the first argument:
> 1:4 - dates # dispatches to `-.Date` despite first argument not being date
Error in `-.Date`(1:4, dates) : can only subtract from "Date" objects
You might be better off using POSIXt dates. They're a bit more flexible, for example if you wanted to add a week or year. The equivalent answer to #BrodieG using lubridate functionality to work with POSIXt:
dates <- ymd(seq(Sys.Date(), Sys.Date()+10, by='day'))
number.of.days <- 1:4
list(dates-1, dates-2, dates-3, dates-4)
lapply(number.of.days, `-.POSIXt`, e1=dates)
Also, how's the fishing in Philly? :)
I have two lists of lists. humanSplit and ratSplit. humanSplit has element of the form::
> humanSplit[1]
$Fetal_Brain_408_AGTCAA_L001_R1_report.txt
humanGene humanReplicate alignment RNAtype
66 DGKI Fetal_Brain_408_AGTCAA_L001_R1_report.txt 6 reg
68 ARFGEF2 Fetal_Brain_408_AGTCAA_L001_R1_report.txt 5 reg
If you type humanSplit[[1]], it gives the data without name $Fetal_Brain_408_AGTCAA_L001_R1_report.txt
RatSplit is also essentially similar to humanSplit with difference in column order. I want to apply fisher's test to every possible pairing of replicates from humanSplit and ratSplit. Now I defined the following empty vector which I will use to store the informations of my fisher's test
humanReplicate <- vector(mode = 'character', length = 0)
ratReplicate <- vector(mode = 'character', length = 0)
pvalue <- vector(mode = 'numeric', length = 0)
For fisher's test between two replicates of humanSplit and ratSplit, I define the following function. In the function I use `geneList' which is a data.frame made by reading a file and has form:
> head(geneList)
human rat
1 5S_rRNA 5S_rRNA
2 5S_rRNA 5S_rRNA
Now here is the main function, where I use a function getGenetype which I already defined in other part of the code. Also x and y are integers :
fishertest <-function(x,y) {
ratReplicateName <- names(ratSplit[x])
humanReplicateName <- names(humanSplit[y])
## merging above two based on the one-to-one gene mapping as in geneList
## defined above.
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[x]], by.x = "rat", by.y = "ratGene")
## [here i do other manipulation with using already defined function
## getGenetype that is defined outside of this function and make things
## necessary to define following contingency table]
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
fisherTest <- fisher.test(contingencyTable)
humanReplicate <- c(humanReplicate,humanReplicateName )
ratReplicate <- c(ratReplicate,ratReplicateName )
pvalue <- c(pvalue , fisherTest$p)
}
After doing all this I do the make matrix eg to use in apply. Here I am basically trying to do something similar to double for loop and then using fisher
eg <- expand.grid(i = 1:length(ratSplit),j = 1:length(humanSplit))
junk = apply(eg, 1, fishertest(eg$i,eg$j))
Now the problem is, when I try to run, it gives the following error when it tries to use function fishertest in apply
Error in humanSplit[[y]] : recursive indexing failed at level 3
Rstudio points out problem in following line:
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
Ultimately, I want to do the following:
result <- data.frame(humanReplicate,ratReplicate, pvalue ,alternative, Conf.int1, Conf.int2, oddratio)
I am struggling with these questions:
In defining fishertest function, how should I pass ratSplit and humanSplit and already defined function getGenetype?
And how I should use apply here?
Any help would be much appreciated.
Up front: read ?apply. Additionally, the first three hits on google when searching for "R apply tutorial" are helpful snippets: one, two, and three.
Errors in fishertest()
The error message itself has nothing to do with apply. The reason it got as far as it did is because the arguments you provided actually resolved. Try to do eg$i by itself, and you'll see that it is returning a vector: the corresponding column in the eg data.frame. You are passing this vector as an index in the i argument. The primary reason your function erred out is because double-bracket indexing ([[) only works with singles, not vectors of length greater than 1. This is a great example of where production/deployed functions would need type-checking to ensure that each argument is a numeric of length 1; often not required for quick code but would have caught this mistake. Had it not been for the [[ limit, your function may have returned incorrect results. (I've been bitten by that many times!)
BTW: your code is also incorrect in its scoped access to pvalue, et al. If you make your function return just the numbers you need and the aggregate it outside of the function, your life will simplify. (pvalue <- c(pvalue, ...) will find pvalue assigned outside the function but will not update it as you want. You are defeating one purpose of writing this into a function. When thinking about writing this function, try to answer only this question: "how do I compare a single rat record with a single human record?" Only after that works correctly and simply without having to overwrite variables in the parent environment should you try to answer the question "how do I apply this function to all pairs and aggregate it?" Try very hard to have your function not change anything outside of its own environment.
Errors in apply()
Had your function worked properly despite these errors, you would have received the following error from apply:
apply(eg, 1, fishertest(eg$i, eg$j))
## Error in match.fun(FUN) :
## 'fishertest(eg$i, eg$j)' is not a function, character or symbol
When you call apply in this sense, it it parsing the third argument and, in this example, evaluates it. Since it is simply a call to fishertest(eg$i, eg$j) which is intended to return a data.frame row (inferred from your previous question), it resolves to such, and apply then sees something akin to:
apply(eg, 1, data.frame(...))
Now that you see that apply is being handed a data.frame and not a function.
The third argument (FUN) needs to be a function itself that takes as its first argument a vector containing the elements of the row (1) or column (2) of the matrix/data.frame. As an example, consider the following contrived example:
eg <- data.frame(aa = 1:5, bb = 11:15)
apply(eg, 1, mean)
## [1] 6 7 8 9 10
# similar to your use, will not work; this error comes from mean not getting
# any arguments, your error above is because
apply(eg, 1, mean())
## Error in mean.default() : argument "x" is missing, with no default
Realize that mean is a function itself, not the return value from a function (there is more to it, but this definition works). Because we're iterating over the rows of eg (because of the 1), the first iteration takes the first row and calls mean(c(1, 11)), which returns 6. The equivalent of your code here is mean()(c(1, 11)) will fail for a couple of reasons: (1) because mean requires an argument and is not getting, and (2) regardless, it does not return a function itself (in a "functional programming" paradigm, easy in R but uncommon for most programmers).
In the example here, mean will accept a single argument which is typically a vector of numerics. In your case, your function fishertest requires two arguments (templated by my previous answer to your question), which does not work. You have two options here:
Change your fishertest function to accept a single vector as an argument and parse the index numbers from it. Bothing of the following options do this:
fishertest <- function(v) {
x <- v[1]
y <- v[2]
ratReplicateName <- names(ratSplit[x])
## ...
}
or
fishertest <- function(x, y) {
if (missing(y)) {
y <- x[2]
x <- x[1]
}
ratReplicateName <- names(ratSplit[x])
## ...
}
The second version allows you to continue using the manual form of fishertest(1, 57) while also allowing you to do apply(eg, 1, fishertest) verbatim. Very readable, IMHO. (Better error checking and reporting can be used here, I'm just providing a MWE.)
Write an anonymous function to take the vector and split it up appropriately. This anonymous function could look something like function(ii) fishertest(ii[1], ii[2]). This is typically how it is done for functions that either do not transform as easily as in #1 above, or for functions you cannot or do not want to modify. You can either assign this intermediary function to a variable (which makes it no longer anonymous, figure that) and pass that intermediary to apply, or just pass it directly to apply, ala:
.func <- function(ii) fishertest(ii[1], ii[2])
apply(eg, 1, .func)
## equivalently
apply(eg, 1, function(ii) fishertest(ii[1], ii[2]))
There are two reasons why many people opt to name the function: (1) if the function is used multiple times, better to define once and reuse; (2) it makes the apply line easier to read than if it contained a complex multi-line function definition.
As a side note, there are some gotchas with using apply and family that, if you don't understand, will be confusing. Not the least of which is that when your function returns vectors, the matrix returned from apply will need to be transposed (with t()), after which you'll still need to rbind or otherwise aggregrate.
This is one area where using ddply may provide a more readable solution. There are several tutorials showing it off. For a quick intro, read this; for a more in depth discussion on the bigger picture in which ddply plays a part, read Hadley's Split, Apply, Combine Strategy for Data Analysis paper from JSS.