I am automating the creation of a series of plots each of which is based on a class of chemicals (e.g., metals, PCBs, etc.); for reasons I'll leave out, I am plotting the legend outside of the plot and using negative values for the inset argument for the legend() function to do this (e.g., inset = c(-0.2, 0)). As each of the chemical classes requires different values for the inset I thought of creating a hash table using the hash package to store the values needed for each chemical class. However, in order to store these in the hash table I was storing the vector of values as a string (e.g., "c(-0.2, 0)").
My code for the hash table looks like this:
legend.hash <- hash(chem.class, c('c(-0.2, 0)', 'c(-0.2, 0)', 'c(-0.25, -0.4)', 'c(-0.25, -0.3)', 'c(-0.2, 0)', 'c(-0.4, -0.2)', 'c(-0.2, 0)', 'c(-0.2, 0)'))
where chem.class is a vector of chemical classes.
Retrieving the values from the resulting hash table are obviously as a string "c(-0.2, 0)", is there a way of converting this string of text so that R interprets it as a function that could be used like the following: legend(..., inset = legend.hash[[chem.class[i]]])?
Or is there a better way to implement this using the traditional graphics system?
The classic way of executing a string as if it was a function is by using eval() and parse() :
> eval(parse(text="c(-0.2,0)"))
[1] -0.2 0.0
But I really wonder why you insist on using a hash instead of a simple list.
legend.hash <- list(c(-0.2, 0), c(-0.2, 0), c(-0.25, -0.4), c(-0.25, -0.3),
c(-0.2, 0), c(-0.4, -0.2), c(-0.2, 0), c(-0.2, 0))
names(legend.hash) <- chem.class
would allow you to use the exact construct you're using now, without all the tricky bits and pieces of eval() and parse(), especially thinking about the infamous fortune(106) :
> require(fortunes)
> fortune(106)
If the answer is parse() you should usually rethink the question.
-- Thomas Lumley
R-help (February 2005)
It may work better to position your legend using the grconvertX and grconvertY functions rather than using negative insets.
If you really want to convert a string with 2 number values in it into a vector of numbers then consider using the strapply function from the gsubfn package. This way you avoid the parse function and all the potential headaches that come with it. It may also end up being faster.
If you change the strings to just the numbers and a seperator (without the 'c' and parens) then you could just use as.numeric on the result of strsplit which may be even faster.
Related
I want to read a bunch of factor data and create a transition matrix from it that I can visualise nicely. I found a very sweet package, called 'heemod' which, together with 'diagram' does a decent job.
For my first quick-and-dirty approach, a ran a piece of Python code to get to the matrix, then used this R sniplet to draw the graph. Note that the transition probabilities come from that undisclosed and less important Python code, but you can also just assume that I calculated it on paper.
library('heemod')
library('diagram')
mat_dim <- define_transition(
state_names = c('State_A', 'State_B', 'State_C'),
.18, .73, .09,
.22, .0, .78,
.58, .08, .33);
plot(mat_dim)
However, I would like to integrate all in R and generate the transition matrix and the graph within R and from the sequence data directly.
This is what I have so far:
library(markovchain)
library('heemod')
library('diagram')
# the data --- this is normally read from a file
data = c(1,2,1,1,1,2,3,1,3,1,2,3,1,2,1,2,3,3,3,1,2,3,2,3,1,2,3,3,1,2,3,3,1)
fdata = factor(data)
rdata = factor(data,labels=c("State_A","State_B","State_C"))
# create transition matrix
dimMatrix = createSequenceMatrix(rdata, toRowProbs = TRUE)
dimMatrix
QUESTION: how can I transfer dimMatrix so that define_transition can process it?
mat_dim <- define_transition( ??? );
plot(mat_dim)
Any ideas? Are there better/easier solutions?
The input to define_transition seems to be quite awkward. Perhaps this is due to my inexperience with the heemod package but it seems the only way to input transitions is element by element.
Here is a workaround
library(heemod)
library(diagram)
first convert the transition matrix to a list. I used rounding on the digits which is optional. This corresponds to the ... variables in define_transition
lis <- as.list(round(dimMatrix, 3))
now add to the list all other named arguments you wish:
lis$state_names = colnames(dimMatrix)
and now pass these arguments to define_transition using do.call:
plot(do.call(define_transition, lis))
Update: to the question in the comments:
lis <- as.list(t(round(dimMatrix, 3)))
lis$state_names = colnames(dimMatrix)
plot(do.call(define_transition, lis))
The reasoning behind do.call
The most obvious way (which does not work here) is to do:
define_transition(dimMatrix, state_names = colnames(dimMatrix))
however this throws an error since the define_transition expects each transition to be supplied as an argument and not as a matrix or a list. In order to avoid typing:
define_transition(0.182, 0.222,..., state_names = colnames(dimMatrix))
one can put all the arguments in a list and then call do.call on that list as I have done.
I am currently trying to make my code dryer by rewriting some parts with the help of functions. One of the functions I am using is:
datasetperuniversity<-function(university,year){assign(paste("data",university,sep=""),subset(get(paste("originaldata",year,sep="")),get(paste("allcollaboration",university,sep=""))==1))}
Executing the function datasetperuniversity("Harvard","2000") would result within the function in something like this:
dataHarvard=subset(originaldata2000,allcollaborationHarvard==1)
The function runs nearly perfectly, except that it does not store a the results in dataHarvard. I read that this is normal in functions, and using the <<- instead of the = could solve this issue, however since I am making use of the assign function this is not really possible, since the = is just the outcome of the assign function.
Here some data:
sales = c(2, 3, 5,6)
numberofemployees = c(1, 9, 20,12)
allcollaborationHarvard = c(0, 1, 0,1)
originaldata = data.frame(sales, numberofemployees, allcollaborationHarvard)
Generally, it's best not to embed data/a variable into the name of an object. So instead of using assign to dataHarvard, make a list data with an element called "Harvard":
# enumerate unis, attaching names for lapply to use
unis = setNames(, "Harvard")
# make a table for each subset with lapply
data = lapply(unis, function(x)
originaldata[originaldata[[ paste0("allcollaboration", x) ]] == 1, ]
)
which gives
> data
$Harvard
sales numberofemployees allcollaborationHarvard
2 3 9 1
4 6 12 1
As seen here, you can use DF[["column name"]] to access a column instead of get as in the OP. Also, see the note in ?subset:
Warning
This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.
Generally, it's also better not to embed data in column names if possible. If the allcollaboration* columns are mutually exclusive, they can be collapsed to a single categorical variable with values like "Harvard", "Yale", etc. Alternately, it might make sense to put the data in long form.
For more guidance on arranging data, I recommend Hadley Wickham's tidy data paper.
I always desire to have my R code as flexible as possible; at present I have three (potentially more) curves to compare based on a parameter delta, but I don't want to hardcode the values of delta anywhere (or even how many values if I can avoid it).
I am trying to make a legend that involves both Greek and a variable substitution for the delta values, so each legend entry is of the form like 'delta = 0.01', where delta is Greek and 0.01 is determined by variable. Many different combinations of paste, substitute, bquote and expression have been tried, but always end up with some verbatim code leftover in the finished legend, OR fail to put 'delta' into symbolic form.
delta <- c(0.01,0.05,0.1)
plot(type="n", x=1:5, y=1:5) #the curves themselves are irrelevant
legend_text <- vector(length=length(delta)) #I don't think lists work either
for(i in 1:length(delta)){
legend_text[i] <- substitute(paste(delta,"=",D),list(D=delta[i]) )
}
legend(x="topleft", fill=rainbow(length(delta)), legend=legend_text)
Since legend=substitute(paste(delta,"=",D),list(D=delta[1]) works for a single entry, I've also tried doing a 'semi-hardcoded' version, fixing the length of delta:
legend(x="topleft", fill=rainbow(length(delta)),
legend=c(substitute(paste(delta,"=",A), list(A=delta[1])),
substitute(paste(delta,"=",B), list(B=delta[2])),
substitute(paste(delta,"=",C), list(C=delta[3])) )
)
but this has the same issues as before.
Is there a way I can do this, or do I need to change the code by hand with each update of delta?
Try using lapply() with as.expression() to generate your legend labels. Also use bquote to create your individual expressions
legend_text <- as.expression(lapply(delta, function(d) {
bquote(delta==.(d))
} ))
Note that with plotmath you need == to get an equals sign. Also no need for paste() since nothing is really a string here.
I'm making an xtableFtable on R Sweave and can't find a way to suppress the digits with this code. What I am doing false? I've read that it can happen if your values aren't numeric but factor or character, but is prop.table making them non-numeric? I'm lost...
library(xtable)
a <- ftable(prop.table(table(mtcars$mpg, mtcars$hp), margin=2)*100)
b <- xtableFtable(a, method = "compact", digits = 0)
print.xtableFtable(b, rotate.colnames = TRUE)
I've already tried with digits=c(0,0,0,0...) too.
You could use options(digits) to control how many digits will print. Try something like options(digits = 4) as the first line of your code (change 4 to whatever value you want between 1 and 22). See ?options for more information.
Or round the values before printing
a = round(ftable(prop.table(table(mtcars$mpg, mtcars$hp), margin=2)*100), 2)
b = xtableFtable(a, method = "compact")
print.xtableFtable(b, rotate.colnames = TRUE)
The "digits" argument to xtableFtable seems to be unimplemented (as of my version, which is 1.8.3), since after playing around with it for half an hour nothing seems to make any difference.
There's a hint to this effect in the function documentation:
It is not recommended that users change the values of align, digits or align. First of all, alternative values have not been tested. Secondly, it is most likely that to determine appropriate values for these arguments, users will have to investigate the code for xtableFtable and/or print.xtableFtable.
It's probably just carried over from the xtable function (on which xtableFtable is surely based) as a TODO which the maintainer hasn't gotten around to yet.
This is a question for anyone familiar with the 'stringdist' package.
I am trying to write a function that does the following:
Searches a very long list of characters such as this (only 16 of ~1 million shown):
> stripList
[1] "AAAAAAAAAAAAAAAAAAAAAAAAAAAADAABAAADCDDAD" "BAAAABBBDACDBABAAADDCBDADBCCBDCDDCDBCDDBA"
[3] "BDDABDCCAAABABBAACADCBDADBCCBDCDDCDBCDDBA" "AADBBACDDDBABDCABAADBCADCBDDDCCC"
[5] "BBCDBBDCCBABDBCABDBBDBDDDADCDDADDDCDDCDDD" "BDDCDACABDCCBACBADCDCBDADBCCBDCDDCDDCDDBA"
[7] "BCDBADCBBDDBBBBDCBDADBCCBDCDDCDBCDDDDAAAA" "DABDDCDACABDCCBACBADC"
[9] "CABABDDCCCCACDCCDCCDADCAAAAAAAAACADADDADA" "BAABCBBBDBCDCDDADDDDCDDADBCCBDCDD"
[11] "BBDDDACDCABDDDBBACDCBDADBCCDDCDDCDDCDDBDD" "BDDABDCCAAABABBBACADCBDADBCCBDCDDCDBCDDBA"
[13] "BDDBBBBDDBDABBACDBDCBDADBCCBDCDD" "BDDABDCCAAABABBBACADCBDADBCCBDCDDCDBCDDBA"
[15] "DABDDCDACABDCCBACBADC" "BBADBACDDBABAACABCABCDCBDADBCCBDCDDCDDDDD"
For instances of each sequence of a query sequences list that is structured like this.
Ex:
SeqName1 # queryNames
BBCDBBDCCBABDBCA # querySeqs
SeqName2 # queryNames
BBBDCCDCCCCDDDCAAACD # querySeqs
I want to see how many times (if any) the query sequence shows up in any of my 'stripList' and allow for 1 insertion, 1 deletion, 1 substitution, and 1 transposition, and get an output like this:
>dt
queryNames TimesFound
SeqName1 5
seqName2 145
To do so I am using the 'amatch' function of the 'stringdist' package in the following manner:
dt<-rapply(as.list(querySeqs), function(x) amatch(x, stripList, method = "osa", useBytes = TRUE, weight = c(d = 0.5, i = 0.5, s = 0.9, t = 0.9), maxDist=0.9))
dt<-data.frame(dt)
colnames(dt) <- "TimesFound"
dt<-cbind(queryNames,dt)
I have a few questions:
In the 'amatch' function, when using method = "osa", how is the "weight" argument interpreted? As an example, if I were to use:
method = "osa", weight = c(d = 0.5, i = 0.5, s = 0.9, t = 0.9), maxDist=0.9
am I saying that I want 90% matching of my "querySeqs"? Meaning, do those fractions pertain to "querySeqs" or my table (stripList)?
What function does "maxDist" have? (Is it also interpreted as a percent?)
Is there a way to maximize the runtime efficiency of my code above (by perhaps using data.table, etc)? I only ask because my actual datasets are ~2000 sequence queries being searched through ~1,000,000 sequence lists.
Is there a better way than 'amatch' to look for whole sequences (not just substrings of them like 'agrep' does)?
I apologize if these are elementary questions, the documentation on this is vague to me and frankly, Im still learning.
Thanks in advance.
This question seems to be up here for a while, and I've only just found it. In short:
(1) The weights are penalties for each action. It allows you to to tell amatch, that e.g. a deletion or insertion is ok, but you thing a transposition should be punished more. Judging by your question, you can probably leave the weights as they are.
(2) Maxdist tells amatch that if two strings are more than maxDist apart, they will never be considered a match. The default is zero so only exact matches are allowed. It is not a percentage. Relevant values of maxDist depend on what distance function is used. I think that you could use method='osa', maxDist=1 (allowing for a single transposition, insertion, deletion or substitution, but no combinations) Or possibly maxDist=4 if you're willing to allow combinations of up to four edits. For the edit-like distances, the distance is bound by the number of characters in the largest string. Please see the R-journal paper for the ranges of all supported distances. http://journal.r-project.org/archive/2014-1/loo.pdf
(3) I am optimizing the code all the time. Version 0.9 will use multithreading. I see you are using rapply, you could probably avoid this by just using
amatch(querySeqs,stripList,method='osa',maxDist=4)
(4) At the moment, I think that amatch is the best implementation for R (but as I'm the author, I may be biased :)).