The automerge guidelines were not met -- Create Julia Package - julia

enter image description here
Hi Every,
May I ask you why this will happen after I create a new Julia package, and register it to https://juliahub.com/ui/Home.
I think I already check the every steps very carefully, but still not figure it out where the bug comes from.
Can anyone tell me how to solve this problem? Many thanks!

Here is the Pull Request that Eric mentioned which tells you exactly what went wrong:
https://github.com/JuliaRegistries/General/pull/48736
Quoting from there:
Name does not meet all of the following: starts with an upper-case >letter, ASCII alphanumerics only, not all letters are upper-case.
Name is not at least 5 characters long
Repo URL does not end with /name.jl.git, where name is the package name
Package name similar to 2 existing packages.
Similar to ALFA. Damerau-Levenshtein distance 2 is at or below >cutoff of 2.
Similar to BBI. Damerau-Levenshtein distance 2 is at or below cutoff >of 2.
Basically, ABBA is not a very descriptive package name, and the general registry favours unambiguous and self-explanatory names.
Now these are only auto-merge rules, which means that they prevent automatic addition of your package to the general registry, but not the addition per se. If you believe that there are good reasons why ABBA is the perfect name for your package and it should be registered under this name, just comment on the PR and it might get manually merged.

Related

How to make an R object immutable? [duplicate]

I'm working in R, and I'd like to define some variables that I (or one of my collaborators) cannot change. In C++ I'd do this:
const std::string path( "/projects/current" );
How do I do this in the R programming language?
Edit for clarity: I know that I can define strings like this in R:
path = "/projects/current"
What I really want is a language construct that guarantees that nobody can ever change the value associated with the variable named "path."
Edit to respond to comments:
It's technically true that const is a compile-time guarantee, but it would be valid in my mind that the R interpreter would throw stop execution with an error message. For example, look what happens when you try to assign values to a numeric constant:
> 7 = 3
Error in 7 = 3 : invalid (do_set) left-hand side to assignment
So what I really want is a language feature that allows you to assign values once and only once, and there should be some kind of error when you try to assign a new value to a variabled declared as const. I don't care if the error occurs at run-time, especially if there's no compilation phase. This might not technically be const by the Wikipedia definition, but it's very close. It also looks like this is not possible in the R programming language.
See lockBinding:
a <- 1
lockBinding("a", globalenv())
a <- 2
Error: cannot change value of locked binding for 'a'
Since you are planning to distribute your code to others, you could (should?) consider to create a package. Create within that package a NAMESPACE. There you can define variables that will have a constant value. At least to the functions that your package uses. Have a look at Tierney (2003) Name Space Management for R
I'm pretty sure that this isn't possible in R. If you're worried about accidentally re-writing the value then the easiest thing to do would be to put all of your constants into a list structure then you know when you're using those values. Something like:
my.consts<-list(pi=3.14159,e=2.718,c=3e8)
Then when you need to access them you have an aide memoir to know what not to do and also it pushes them out of your normal namespace.
Another place to ask would be R development mailing list. Hope this helps.
(Edited for new idea:) The bindenv functions provide an
experimental interface for adjustments to environments and bindings within environments. They allow for locking environments as well as individual bindings, and for linking a variable to a function.
This seems like the sort of thing that could give a false sense of security (like a const pointer to a non-const variable) but it might help.
(Edited for focus:) const is a compile-time guarantee, not a lock-down on bits in memory. Since R doesn't have a compile phase where it looks at all the code at once (it is built for interactive use), there's no way to check that future instructions won't violate any guarantee. If there's a right way to do this, the folks at the R-help list will know. My suggested workaround: fake your own compilation. Write a script to preprocess your R code that will manually substitute the corresponding literal for each appearance of your "constant" variables.
(Original:) What benefit are you hoping to get from having a variable that acts like a C "const"?
Since R has exclusively call-by-value semantics (unless you do some munging with environments), there isn't any reason to worry about clobbering your variables by calling functions on them. Adopting some sort of naming conventions or using some OOP structure is probably the right solution if you're worried about you and your collaborators accidentally using variables with the same names.
The feature you're looking for may exist, but I doubt it given the origin of R as a interactive environment where you'd want to be able to undo your actions.
R doesn't have a language constant feature. The list idea above is good; I personally use a naming convention like ALL_CAPS.
I took the answer below from this website
The simplest sort of R expression is just a constant value, typically a numeric value (a number) or a character value (a piece of text). For example, if we need to specify a number of seconds corresponding to 10 minutes, we specify a number.
> 600
[1] 600
If we need to specify the name of a file that we want to read data from, we specify the name as a character value. Character values must be surrounded by either double-quotes or single-quotes.
> "http://www.census.gov/ipc/www/popclockworld.html"
[1] "http://www.census.gov/ipc/www/popclockworld.html"

Finding spaces in many time pad

I'm currently completing an online course in Cryptography, and have been give an exercise to complete. The course been running for a while and I know the answer is on the web but would like to complete it myself thought actions and research.
I have a list of 13 cipertext based on one/many time Pad - the same cipher key has been used to encrypt plain text. My task is to decrypt the last ciphertext.
The steps I have taken so far are based on cribing techniques at the following location:
http://adamsblog.aperturelabs.com/2013/05/back-to-skule-one-pad-two-pad-me-pad.html
https://crypto.stackexchange.com/questions/6020/many-time-pad-attack
and I'm using the following tool to XOR the ciphertexts.
In the tutorial I'm following the author suggest that the first step is to identify spaces I have tried to follow the steps but still cannot Identify the spaces once I Xor the cipher's
When I XOR the first cipertext i.e cipher 1 with cipher 2 and 3 I get the following:
15040900180952180C4549114F190E0159490A49120E00521201064C0A4F144F281B13490341004F480203161600000046071B1E4119061D1A411848090F4E0D0000161A0A41140C16160600151C00170B0B090653014E410D4C530F01150116000307520F104D200345020615041C541A49151744490F4C0D0015061C0A1F454F1F4509074D2F01544C080A090028061C1D002E413D004E0B141118
000D064819001E0303490A124C5615001647160C1515451A041D544D0B1D124C3F4F0252021707440D0B4C1100001E075400491E4F1F0A5211070A490B080B0A0700190D044E034F110A00001300490F054F0E08100357001E0853D4315FCEACFA7112C3E55D74AAF3394BB08F7504A8E5019C4E3E838E0F364946F31721A49AD2D24FF6775EFCB4F79FE4217A01B43CB5068BF3B52CA76543187274
000000003E010609164E0C07001F16520D4801490B09160645071950011D0341281B5253040F094C0D4F08010545050150050C1D544D061C5415044548090717074F0611454F164F1F101F411A4F430E0F0219071A0B411505034E461C1B0310454F12480D55040F18451E1B1F0A1C541646410D054C0D4C1B410F1B1B03149AD2D24FF6775EFCB4F79FE4217A01B43CB5068BF3B52CA76543187274
I'm getting confused at to where the space are based on the ASCII table which gives a 20(HEX) to the value a space.
Sorry if this is not enough information I can if more is required. Thanks Mark.
The question you're linking to has the correct answer.
You don't appear to use that method. You should have c1⊕c2 and c1⊕c3 but your question contains three blocks of strings, one with 6 lines, one with 5 lines and one with 4 lines. That's insufficient for us to even guess what your problem is. Go back to the linked answer, read it, and follow the steps listed there.

Genbank query (package seqinr): searching in sequence description

I am using the function query() of package seqinr to download myoglobin DNA sequences from Genbank. E.g.:
query("myoglobins","K=myoglobin AND SP=Turdus merula")
Unfortunately, for a lot of the species I'm looking for I don't get any sequence at all (or for this species, only a very short one), even though I find sequences when I search manually on the website. This is because of searching for "myoglobin" in the keywords only, while often there isn't any entry in there. Often the protein type is only specified in the name ("definition" on Genbank) -- but I have no idea how to search for this.
The help page on query() doesn't seem to offer any option for this in the details, a "generic search" without any "K=" doesn't work, and I haven't found anything via googling.
I'd be happy about any links, explanations and help. Thank you! :)
There is a complete manual for the seqinr package which describes the query language more in depth in chapter 5 (available at http://seqinr.r-forge.r-project.org/seqinr_2_0-1.pdf). I was trying to do a similar query and the description for many of the genes/cds is blank so they don't come up when searching using the k= option. One alternative would be to search for the organism alone, then match gene names in the individual annotations and pull out the accession numbers, which you could then use to re-query the database for your sequences.
This would pull out the annotation for the first gene:
choosebank("emblTP")
query("ACexample", "sp=Turdus merula")
getName(ACexample$req[[1]])
annotations <- getAnnot(ACexample$req[[1]])
cat(annotations, sep = "\n")
I think that this would be a pretty time consuming way to tackle the problem but there doesn't seem to be an efficient way of searching the annotations directly. I'd be interested in any solutions you might come up with.

What are fortunes?

In R, one sometime sees people making references to fortunes. For example:
fortune(108)
What does this mean? Where does this originate? Where can I get the code?
Edit. The sharp-eyed reader would have noticed that this question marks the 5,000th question with the [r] tag. Forgive the frivolity, but such a milestone should be marked with a bit of humour. For an extra bit of fun, you can provide an answer with your favourite fortune cookie.
It refers to the fortunes package, which is a package that contains a whole set of humorous quotes and comments from the help lists, conferences, fora and even StackOverflow.
It is actually a database or small dataframe you can browse through.
library(fortunes)
fortune()
To get a random one. Or look for a specific one, eg :
> fortune("stackoverflow")
datayoda: Bing is my friend...I found the cumsum() function.
Dirk Eddelbuettel: If bing is your friend, then rseek.org is bound
to be your uncle.
-- datayoda and Dirk Eddelbuettel (after searching for a function that
computes cumulative sums)
stackoverflow.com (October 2010)
If you want to get all of them in a dataframe, just do
MyFortunes <- read.fortunes()
The numbers sometimes referred to, are the row numbers of this dataframe. To find everything on stackoverflow :
> grep("(?i)stackoverflow",MyFortunes$source)
[1] 273 275
> fortune(275)
I used a heuristic... pulled from my posterior. That makes it Bayesian, right?
-- JD Long (in a not too serious chat about modeling strategies)
Stackoverflow (November 2010)
And for the record, 108 is is this one:
R> library(fortunes)
R> fortune(108)
Actually, I see it as part of my job to inflict R on people who are
perfectly happy to have never heard of it. Happiness doesn't equal
proficient and efficient. In some cases the proficiency of a person
serves a greater good than their momentary happiness.
-- Patrick Burns
R-help (April 2005)
R>
They're humorous (sometimes snarky) comments collected from the R lists.
install.packages("fortunes")
Or more generally
install.packages("sos")
library("sos")
findFn("fortune")
A quick search on CRAN turns up the fortunes package, which basically just prints random witty quotes related to R. The concept is based on the fortune program from Unix.

Fuzzy matching of product names

I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database.
For example "Canon PowerShot a20IS", "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS"
should all match "Canon PowerShot A20 IS". I've worked with levenshtein distance with some added heuristics (removing obvious common words, assigning higher cost to number changes etc), which works to some extent, but not well enough unfortunately.
The main problem is that even single-letter changes in relevant keywords can make a huge difference, but it's not easy to detect which are the relevant keywords. Consider for example three product names:
Lenovo T400
Lenovo R400
New Lenovo T-400, Core 2 Duo
The first two are ridiculously similar strings by any standard (ok, soundex might help to disinguish the T and R in this case, but the names might as well be 400T and 400R), the first and the third are quite far from each other as strings, but are the same product.
Obviously, the matching algorithm cannot be a 100% precise, my goal is to automatically match around 80% of the names with a high confidence.
Any ideas or references is much appreciated
I think this will boil down to distinguishing key words such as Lenovo from chaff such as New.
I would run some analysis over the database of names to identify key words. You could use code similar to that used to generate a word cloud.
Then I would hand-edit the list to remove anything obviously chaff, like maybe New is actually common but not key.
Then you will have a list of key words that can be used to help identify similarities. You would associate the "raw" name with its keywords, and use those keywords when comparing two or more raw names for similarities (literally, percentage of shared keywords).
Not a perfect solution by any stretch, but I don't think you are expecting one?
The key understanding here is that you do have a proper distance metric. That is in fact not your problem at all. Your problem is in classification.
Let me give you an example. Say you have 20 entries for the Foo X1 and 20 for the Foo Y1. You can safely assume they are two groups. On the other hand, if you have 39 entries for the Bar X1 and 1 for the Bar Y1, you should treat them as a single group.
Now, the distance X1 <-> Y1 is the same in both examples, so why is there a difference in the classification? That is because Bar Y1 is an outlier, whereas Foo Y1 isn't.
The funny part is that you do not actually need to do a whole lot of work to determine these groups up front. You simply do an recursive classification. You start out with node per group, and then add the a supernode for the two closest nodes. In the supernode, store the best assumption, the size of its subtree and the variation in it. As many of your strings will be identical, you'll soon get large subtrees with identical entries. Recursion ends with the supernode containing at the root of the tree.
Now map the canonical names against this tree. You'll quickly see that each will match an entire subtree. Now, use the distances between these trees to pick the distance cutoff for that entry. If you have both Foo X1 and Foo Y1 products in the database, the cut-off distance will need to be lower to reflect that.
edg's answer is in the right direction, I think - you need to distinguish key words from fluff.
Context matters. To take your example, Core 2 Duo is fluff when looking at two instances of a T400, but not when looking at a a CPU OEM package.
If you can mark in your database which parts of the canonical form of a product name are more important and must appear in one form or another to identify a product, you should do that. Maybe through the use of some sort of semantic markup? Can you afford to have a human mark up the database?
You can try to define equivalency classes for things like "T-400", "T400", "T 400" etc. Maybe a set of rules that say "numbers bind more strongly than letters attached to those numbers."
Breaking down into cases based on manufacturer, model number, etc. might be a good approach. I would recommend that you look at techniques for term spotting to try and accomplish that: http://www.worldcat.org/isbn/9780262100854
Designing everything in a flexible framework that's mostly rule driven, where the rules can be modified based on your needs and emerging bad patterns (read: things that break your algorithm) would be a good idea, as well. This way you'd be able to improve the system's performance based on real world data.
You might be able to make use of a trigram search for this. I must admit I've never seen the algorithm to implement an index, but have seen it working in pharmaceutical applications, where it copes very well indeed with badly misspelt drug names. You might be able to apply the same kind of logic to this problem.
This is a problem of record linkage. The dedupe python library provides a complete implementation, but even if you don't use python, the documentation has a good overview of how to approach this problem.
Briefly, within the standard paradigm, this task is broken into three stages
Compare the fields, in this case just the name. You can use one or more comparator for this, for example an edit distance like the Levenshtein distance or something like the cosine distance that compares the number of common words.
Turn an array fo distance scores into a probability that a pair of records are truly about the same thing
Cluster those pairwise probability scores into groups of records that likely all refer to the same thing.
You might want to create logic that ignores the letter/number combination of model numbers (since they're nigh always extremely similar).
Not having any experience with this type of problem, but I think a very naive implementation would be to tokenize the search term, and search for matches that happen to contain any of the tokens.
"Canon PowerShot A20 IS", for example, tokenizes into:
Canon
Powershot
A20
IS
which would match each of the other items you want to show up in the results. Of course, this strategy will likely produce a whole lot of false matches as well.
Another strategy would be to store "keywords" with each item, such as "camera", "canon", "digital camera", and searching based on items that have matching keywords. In addition, if you stored other attributes such as Maker, Brand, etc., you could search on each of these.
Spell checking algorithms come to mind.
Although I could not find a good sample implementation, I believe you can modify a basic spell checking algorithm to comes up with satisfactory results. i.e. working with words as a unit instead of a character.
The bits and pieces left in my memory:
Strip out all common words (a, an, the, new). What is "common" depends on context.
Take the first letter of each word and its length and make that an word key.
When a suspect word comes up, looks for words with the same or similar word key.
It might not solve your problems directly... but you say you were looking for ideas, right?
:-)
That is exactly the problem I'm working on in my spare time. What I came up with is:
based on keywords narrow down the scope of search:
in this case you could have some hierarchy:
type --> company --> model
so that you'd match
"Digital Camera" for a type
"Canon" for company and there you'd be left with much narrower scope to search.
You could work this down even further by introducing product lines etc.
But the main point is, this probably has to be done iteratively.
We can use the Datadecision service for matching products.
It will allow you to automatically match your product data using statistical algorithms. This operation is done after defining a threshold score of confidence.
All data that cannot be automatically matched will have to be manually reviewed through a dedicated user interface.
The online service uses lookup tables to store synonyms as well as your manual matching history. This allows you to improve the data matching automation next time you import new data.
I worked on the exact same thing in the past. What I have done is using an NLP method; TF-IDF Vectorizer to assign weights to each word. For example in your case:
Canon PowerShot a20IS
Canon --> weight = 0.05 (not a very distinguishing word)
PowerShot --> weight = 0.37 (can be distinguishing)
a20IS --> weight = 0.96 (very distinguishing)
This will tell your model which words to care and which words to not. I had quite good matches thanks to TF-IDF.
But note this: a20IS cannot be recognized as a20 IS, you may consider to use some kind of regex to filter such cases.
After that, you can use a numeric calculation like cosine similarity.

Resources