I keep losing my session* in the console when trying to perform readLines (from base) with map (from purrr).
*Don't get new line, and R doesn't seem to be running anything
If I input a vector of file paths:
paths <- c("a/file.csv", "a/nother_file.csv")
And try and get all top lines out with map and readLines, R dies.
result <- map(paths, readLines(n = 1))
But if I do:
result <- map(1:2, function(x) readLines(paths[x], n = 1))
It works.
What am I doing wrong?
The solution has already been posted. Here’s a brief explanation what happens in your case:
To use purrr::map, you are supposed to pass it a function. But readLines(n = 1) isn’t a function, it’s a function call expression. This is very different: to give another example, sum is a function, sum(1 : 10) is a function call expression, which evaluates to the integer value 55. But sum, on its own, evaluates to … itself: a function, which can be called (and you can’t call sum(1 : 10): it’s just an integer).
When you write readLine(n = 1), that function is invoked immediately when map is called — not by purrr on the data, but rather just as it stands. The same happens if you invoke readLines(n = 1) directly, without wrapping it in map(…).
But this isn’t killing the R session. Instead, it’s telling readLines to read from the file that is specified as its default. Looking at the documentation of the function, we see:
readLines(con = stdin(), n = -1L, ok = TRUE, warn = TRUE,
encoding = "unknown", skipNul = FALSE)
con = stdin() — by default, readLines is reading from standard input. In an interactive terminal, this blocks until the standard input (that is, the interactive terminal) sends an “end of file” instruction. On most command lines, you can simulate this by pressing the key combination Ctrl+D. Inside RStudio, the behaviour may be different.
this will work:
result <- map(paths, readLines, n = 1)
from `?purrr::map
Usage
map(.x, .f, ...)
... Additional arguments passed on to .f.
Related
I'd like to wrap around the checkmate library's qassert to check multiple variable's specification at a time. Importantly, I'd like assertion errors to still report the variable name that's out of spec.
So I create checkargs to loop through input arguments. But to get the variable passed on to qassert, I use the same code for each loop -- that ambigious code string gets used for for the error message instead of the problematic variable name.
qassert() (via vname()) is getting what to display in the assertion error like deparse(eval.parent(substitute(substitute(x))). Is there any way to box up get(var) such that that R will see e.g. 'x' on deparse instead?
At least one work around is eval(parse()). But something like checkargs(x="n', system('echo malicious',intern=T),'") has me hoping for an alternative.
checkargs <- function(...) {
args<-list(...)
for(var in names(args))
checkmate::qassert(get(var,envir=parent.frame()),args[[var]])
# scary string interpolation alternative
#eval(parse(text=paste0("qassert(",var,",'",args[[var]], "')")),parent.frame())
}
test_checkargs <- function(x, y) {checkargs(x='b',y='n'); print(y)}
# checkargs is working!
test_checkargs(T, 1)
# [1] 1
# but the error message isn't helpful.
test_checkargs(1, 1)
# Error in checkargs(x = "b", y = "n") :
# Assertion on 'get(var, envir = parent.frame())' failed. Must be of class 'logical', not 'double'.
#
# want:
# Assertion on 'x' failed. ...
substitute() with as.name seems to do the trick. This still uses eval but without string interpolation.
eval(substitute(
qassert(x,spec),
list(x=as.name(var),
spec=args[[var]])),
envir=parent.frame())
Iam trying to extract data from a website using a custom function:
library(tidyverse)
library(rvest)
url = "https://www.boerse.de/fundamental-analyse/garbage/" # last part does not change outcome, therefore 'garbage'
read_html_tables = function(ISIN){
content <- read_html(paste0(url,ISIN,"#guv")) %>%
html_table(dec = ",") %>%
.[c(5:10)]
return(content)
}
If I run this function with a given ISIN, e.g. US88579Y1010, I get the desired result. A list containing 6 tibbles with the data I want. But if I wrap this function into lapply() with a vector containing a few hundred ISIN, I get the following error:
list_of_all <- lapply(X = df[,2], FUN = read_html_tables)
Error: x must be a string of length 1
Called from: read_xml.character(x, encoding = encoding, ..., as_html = TRUE,
options = options)
If I call which(length(df[,2]) != 1) (the column where the ISINs are), I get integer(0), so there seems to be no issue with the ISIN column in this dataframe. And since it works with a single ISIN as input, the read_html(paste0(url,ISIN)) part seems to work as well.
I have used a very similar function before and wrapped it into lapply(). The earlier function did basically exactly what this function does, but had to do some searching and combining for the correct URL to pass into the read_html(paste0(url,ISIN)) part (on another website).
Iam a bit puzzled, since this error did not occure beforehand. But if it occured and I try to run the earlier function now, I get the same error (which I didn't receive any time before).
Maybe there is a more talented R-programmer out there which can spot the issue?
Edit: Since a reply suggested the ISIN-list is the issue:
The first two are US88579Y1010 and US8318652091. Passed individually into the function as well as passing it in a vector (c(ISIN1, ISIN2)) and passing the vector to lapply works. But if I point at both ISINs inside the tibble (df[1:2,2]) I get the error from above. What am I missing here?
Solution:
read_xml.character from read_html() seems to not accept a column from a tibble as valid input. Transfering the tibble to a data.frame and recalculating gives the desired output.
I'm trying to pass two file paths as parameters to a function. But it's not accepting the inputs. Here's what I'm doing:
partition<-function(d1,p2){
d1<-read.table(file = d1, fill = TRUE)
p2<-read.table(file = p2, fill = TRUE)
}
and while calling the function:
partition("samcopy.txt","partcopy.txt")
The .txt is not being read by the variables inside the function. How to make the variables read the table?
AidanGawronSki's approach works, but from a programming standpoint should be avoided! Here is a more traditional answer to your problem.
partition<-function(d1,p2){
a <- read.table(file = d1, fill = TRUE)
b <- read.table(file = p2, fill = TRUE)
res <- list(a,b)
names(res) <- c(d1,p2)
res
}
To understand why the above approach is "better", it is important to understand what environments are and more generally the R scoping rules. Environments are essentially your workspace. For example, when you first open R and begin assigning objects, these objects are stored within the Global Environment. Another example of an environment is when you call a function, the function creates it own environment comprised of any parameters you have passed to the function. By doing this R ensures that when you call a function, it has no "side effects" or said another way it does not affect the global environment.
Let me show you an example. Imagine you begin an R session, and assign d1 <- 1 in your Global Environment. You're going to want to use d1 later on in your analysis and it would be a shame if it changed without you knowing it, right?
If you utilize AidanGawronSki's approach when you call
partition<-function(d1,p2){
d1 <<- read.table(file = d1, fill = TRUE)
}
The d1 in your Global Environment will change to be read.table(file = d1, fill = TRUE). This is very very dangerous! A object you previously assigned to be one thing is now another thing and you are not even warned of this change.
The same problem, however, will never occur with the approach I have proposed. I strongly recommend you get in the habit of using this approach! If you don't any function can change things in your Global Environment without you knowing.
For more info read this, this or just google something like "functions with no side effects"
FYI there are also several other problems with your code. First you need to tell your function what to return. All you did is call a function, assign stuff to the local environment and then close the function. Functions will always return the last line (as long as it is not an assignment). This is why in my example, I put res as the last line of the function. Also you are not correctly assigning your object. You pass a string like d1 <- "text.txt", to your function and then ask your function to do the following, "text.txt" <- read.table("text.txt",...). That simply does not make sense. You need to assign the output from read.table to an object. In my example, I assign them to a and b.
use the super assignment operator <<-
partition<-function(d1,p2){
d1 <<- read.table(file = d1, fill = TRUE)
p2 <<- read.table(file = p2, fill = TRUE)
}
I have two lists of lists. humanSplit and ratSplit. humanSplit has element of the form::
> humanSplit[1]
$Fetal_Brain_408_AGTCAA_L001_R1_report.txt
humanGene humanReplicate alignment RNAtype
66 DGKI Fetal_Brain_408_AGTCAA_L001_R1_report.txt 6 reg
68 ARFGEF2 Fetal_Brain_408_AGTCAA_L001_R1_report.txt 5 reg
If you type humanSplit[[1]], it gives the data without name $Fetal_Brain_408_AGTCAA_L001_R1_report.txt
RatSplit is also essentially similar to humanSplit with difference in column order. I want to apply fisher's test to every possible pairing of replicates from humanSplit and ratSplit. Now I defined the following empty vector which I will use to store the informations of my fisher's test
humanReplicate <- vector(mode = 'character', length = 0)
ratReplicate <- vector(mode = 'character', length = 0)
pvalue <- vector(mode = 'numeric', length = 0)
For fisher's test between two replicates of humanSplit and ratSplit, I define the following function. In the function I use `geneList' which is a data.frame made by reading a file and has form:
> head(geneList)
human rat
1 5S_rRNA 5S_rRNA
2 5S_rRNA 5S_rRNA
Now here is the main function, where I use a function getGenetype which I already defined in other part of the code. Also x and y are integers :
fishertest <-function(x,y) {
ratReplicateName <- names(ratSplit[x])
humanReplicateName <- names(humanSplit[y])
## merging above two based on the one-to-one gene mapping as in geneList
## defined above.
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[x]], by.x = "rat", by.y = "ratGene")
## [here i do other manipulation with using already defined function
## getGenetype that is defined outside of this function and make things
## necessary to define following contingency table]
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
fisherTest <- fisher.test(contingencyTable)
humanReplicate <- c(humanReplicate,humanReplicateName )
ratReplicate <- c(ratReplicate,ratReplicateName )
pvalue <- c(pvalue , fisherTest$p)
}
After doing all this I do the make matrix eg to use in apply. Here I am basically trying to do something similar to double for loop and then using fisher
eg <- expand.grid(i = 1:length(ratSplit),j = 1:length(humanSplit))
junk = apply(eg, 1, fishertest(eg$i,eg$j))
Now the problem is, when I try to run, it gives the following error when it tries to use function fishertest in apply
Error in humanSplit[[y]] : recursive indexing failed at level 3
Rstudio points out problem in following line:
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
Ultimately, I want to do the following:
result <- data.frame(humanReplicate,ratReplicate, pvalue ,alternative, Conf.int1, Conf.int2, oddratio)
I am struggling with these questions:
In defining fishertest function, how should I pass ratSplit and humanSplit and already defined function getGenetype?
And how I should use apply here?
Any help would be much appreciated.
Up front: read ?apply. Additionally, the first three hits on google when searching for "R apply tutorial" are helpful snippets: one, two, and three.
Errors in fishertest()
The error message itself has nothing to do with apply. The reason it got as far as it did is because the arguments you provided actually resolved. Try to do eg$i by itself, and you'll see that it is returning a vector: the corresponding column in the eg data.frame. You are passing this vector as an index in the i argument. The primary reason your function erred out is because double-bracket indexing ([[) only works with singles, not vectors of length greater than 1. This is a great example of where production/deployed functions would need type-checking to ensure that each argument is a numeric of length 1; often not required for quick code but would have caught this mistake. Had it not been for the [[ limit, your function may have returned incorrect results. (I've been bitten by that many times!)
BTW: your code is also incorrect in its scoped access to pvalue, et al. If you make your function return just the numbers you need and the aggregate it outside of the function, your life will simplify. (pvalue <- c(pvalue, ...) will find pvalue assigned outside the function but will not update it as you want. You are defeating one purpose of writing this into a function. When thinking about writing this function, try to answer only this question: "how do I compare a single rat record with a single human record?" Only after that works correctly and simply without having to overwrite variables in the parent environment should you try to answer the question "how do I apply this function to all pairs and aggregate it?" Try very hard to have your function not change anything outside of its own environment.
Errors in apply()
Had your function worked properly despite these errors, you would have received the following error from apply:
apply(eg, 1, fishertest(eg$i, eg$j))
## Error in match.fun(FUN) :
## 'fishertest(eg$i, eg$j)' is not a function, character or symbol
When you call apply in this sense, it it parsing the third argument and, in this example, evaluates it. Since it is simply a call to fishertest(eg$i, eg$j) which is intended to return a data.frame row (inferred from your previous question), it resolves to such, and apply then sees something akin to:
apply(eg, 1, data.frame(...))
Now that you see that apply is being handed a data.frame and not a function.
The third argument (FUN) needs to be a function itself that takes as its first argument a vector containing the elements of the row (1) or column (2) of the matrix/data.frame. As an example, consider the following contrived example:
eg <- data.frame(aa = 1:5, bb = 11:15)
apply(eg, 1, mean)
## [1] 6 7 8 9 10
# similar to your use, will not work; this error comes from mean not getting
# any arguments, your error above is because
apply(eg, 1, mean())
## Error in mean.default() : argument "x" is missing, with no default
Realize that mean is a function itself, not the return value from a function (there is more to it, but this definition works). Because we're iterating over the rows of eg (because of the 1), the first iteration takes the first row and calls mean(c(1, 11)), which returns 6. The equivalent of your code here is mean()(c(1, 11)) will fail for a couple of reasons: (1) because mean requires an argument and is not getting, and (2) regardless, it does not return a function itself (in a "functional programming" paradigm, easy in R but uncommon for most programmers).
In the example here, mean will accept a single argument which is typically a vector of numerics. In your case, your function fishertest requires two arguments (templated by my previous answer to your question), which does not work. You have two options here:
Change your fishertest function to accept a single vector as an argument and parse the index numbers from it. Bothing of the following options do this:
fishertest <- function(v) {
x <- v[1]
y <- v[2]
ratReplicateName <- names(ratSplit[x])
## ...
}
or
fishertest <- function(x, y) {
if (missing(y)) {
y <- x[2]
x <- x[1]
}
ratReplicateName <- names(ratSplit[x])
## ...
}
The second version allows you to continue using the manual form of fishertest(1, 57) while also allowing you to do apply(eg, 1, fishertest) verbatim. Very readable, IMHO. (Better error checking and reporting can be used here, I'm just providing a MWE.)
Write an anonymous function to take the vector and split it up appropriately. This anonymous function could look something like function(ii) fishertest(ii[1], ii[2]). This is typically how it is done for functions that either do not transform as easily as in #1 above, or for functions you cannot or do not want to modify. You can either assign this intermediary function to a variable (which makes it no longer anonymous, figure that) and pass that intermediary to apply, or just pass it directly to apply, ala:
.func <- function(ii) fishertest(ii[1], ii[2])
apply(eg, 1, .func)
## equivalently
apply(eg, 1, function(ii) fishertest(ii[1], ii[2]))
There are two reasons why many people opt to name the function: (1) if the function is used multiple times, better to define once and reuse; (2) it makes the apply line easier to read than if it contained a complex multi-line function definition.
As a side note, there are some gotchas with using apply and family that, if you don't understand, will be confusing. Not the least of which is that when your function returns vectors, the matrix returned from apply will need to be transposed (with t()), after which you'll still need to rbind or otherwise aggregrate.
This is one area where using ddply may provide a more readable solution. There are several tutorials showing it off. For a quick intro, read this; for a more in depth discussion on the bigger picture in which ddply plays a part, read Hadley's Split, Apply, Combine Strategy for Data Analysis paper from JSS.
I'm attempting to use the bnlearn package to calculate conditional probabilities, and I'm running into a problem when the "cpquery" function is used within a loop. I've created an example, shown below, using data included with the package. When using the cpquery function in a loop, a variable created in the loop ("evi" in the example) is not recognized by the function. I receive the error:
Error in parse(text = evi) : object 'evi' not found
The creation steps of "evi" are based on examples provided by the author.
Any help you could provide would be great. I'm desperate to find a way that I can apply the cpquery function for a large number of observations.
library(bnlearn)
data(learning.test)
fitted = bn.fit(hc(learning.test), learning.test)
bn.function <- function(network, evidence_data) {
a <- NULL
b <- nrow(evidence_data)
for (i in 1:b) {
evi <- paste("(", names(evidence_data), "=='",
sapply(evidence_data[i,], as.character), "')",
sep = "", collapse = " & ")
a[i] <- cpquery(network, (C=='c'), eval(parse(text=evi)))
}
return(a)
}
test <- bn.function(fitted, learning.test)
Thanks in advance!
I don't know if this is due to a bugfix or just because I tried another approach - anyways, looping works if you iteratively build up the evidence list outside of the cpquery-function.
An example for an iteration through a list called evidenceData with all-positive evidences:
for(i in names(evidenceData)){
loopEvidenceList <- list()
loopEvidenceList[[i]] <- "TRUE"
a =cpquery(fitted = bayesNet, event = queryNode == "TRUE",
evidence = loopEvidenceList, method = "lw", n = 100000)
print(a)
}
Depending on the way your evidence is availible, you might need more sophisticated preparation of the "loopEvidenceList" but once you got that prepared, it works fine.
To avoid the scoping problem, you can postpone the call to eval and do it inside the cpquery function. If you directly pass evi (the character variable) to cpquery and then parse it inside the definition, the chain of environments gets shifted and cpquery will have access to evi.
You can use m.cpquery <- edit(cpquery) to fork your own version of the function and insert the following line at its beginning:
evidence = parse(text = evidence)
and then save your new function.
So the heading of m.cpquery will look like:
> m.cpquery
function (fitted, event, evidence, cluster = NULL, method = "ls",
..., debug = FALSE)
{
evidence = parse(text = evidence)
check.fit(fitted)
check.logical(debug)
...
Now you can use m.cpquery in your own function like before, except we'll pass the plain character variable to it:
a[i] <- m.cpquery(network, (C=='c'), evi)
Note that in the first line of m.cpquery, we only parsed the evidence character variable and didn't call eval on it. cpquery is a front-end to conditional.probability.query (see here) and we're relying on conditional.probability.query's subsequent call to eval.
I should say that this is a rather ugly workaround. And it only works if you are using logic sampling (method='ls'). But if you want to use likelihood weighting, the check.mutilated.evidence function will raise an error. I haven't checked if injecting an eval expression before it gets called would result in a mayhem of subsequent errors leading to hell.
I feel like the problem is you are using the same variable in evidence as well as event. Learning.test contains the values of "C" variable. then we are trying to predict C as the event. Maybe using a subset of the original dataset excluding C will do the trick