ext_pillar feature allows salt master to append dynamic data to each minion's pillar dictionary. This is achieved by salt master evaluating an eponymous python function on behalf of each minion during the pillar refresh stage.
When a large number of minions is present in the stack, the ext_pillar function would be evaluated multiple times, on behalf of each minion. In many cases this is highly undesirable, due to performance or other resource limits.
Thus, the question arises: is there a way to evaluate the ext_pillar function once per pillar_refresh command and then reuse the produced dictionary for all selected minions?
There isn't a way to do that right now. There has been an optimization done to many of the external pillar modules to cache the connection to the database, so that at least the connection is not created and torn down for each minion. But that doesn't have the optimization you're talking about regarding making one large query for all the minions at once.
Would you mind opening an issue here: https://github.com/saltstack/salt/issues/new
That will help make sure this gets discussed and considered. I see a lot of value in it.
Related
I am working on an application at the moment that is using as a caching strategy the reading and writing of data to text files in a read/write directory within the application.
My gut reaction is that this is sooooo wrong.
I am of the opinion that these values should be stored in the ASP.NET Cache or another dedicated in-memory cache such as Redis or something similar.
Can you provide any data to back up my belief that writing to and reading from text files as a form of cache on the webserver is the wrong thing to do? Or provide any data to prove me wrong and show that this is the correct thing to do?
What other options would you provide to implement this caching?
EDIT:
In one example, a complex search is performed based on a keyword. The result from this search is a list of Guids. This is then turned into a concatenated, comma-delimited string, usually less than 100,000 characters. This is then written to a file using that keyword as its name so that other requests using this keyword will not need to perform the complex search. There is an expiry - I think three days or something, but I don't think it needs to (or should) be that long
I would normally use the ASP.NET Server Cache to store this data.
I can think of four reasons:
Web servers are likely to have many concurrent requests. While you can write logic that manages file locking (mutexes, volatile objects), implementing that is a pain and requires abstraction (an interface) if you plan to be able to refactor it in the future--which you will want to do, because eventually the demand on the filesystem resource will be heavier than what can be addressed in a multithreaded context.
Speaking of which, unless you implement paging, you will be reading and writing the entire file every time you access it. That's slow. Even paging is slow compared to an in-memory operation. Compare what you think you can get out of the disks you're using with the Redis benchmarks from New Relic. Feel free to perform your own calculation based on the estimated size of the file and the number of threads waiting to write to it. You will never match an in-memory cache.
Moreover, as previously mentioned, asynchronous filesystem operations have to be managed while waiting for synchronous I/O operations to complete. Meanwhile, you will not have data consistent with the operations the web application executes unless you make the application wait. The only way I know of to fix that problem is to write to and read from a managed system that's fast enough to keep up with the requests coming in, so that the state of your cache will almost always reflect the latest changes.
Finally, since you are talking about a text file, and not a database, you will either be determining your own object notation for key-value pairs, or using some prefabricated format such as JSON or XML. Either way, it only takes one failed operation or one improperly formatted addition to render the entire text file unreadable. Then you either have the option of restoring from backup (assuming you implement version control...) and losing a ton of data, or throwing away the data and starting over. If the data isn't important to you anyway, then there's no reason to use the disk. If the point of keeping things on disk is to keep them around for posterity, you should be using a database. If having a relational database is less important than speed, you can use a NoSQL context such as MongoDB.
In short, by using the filesystem and text, you have to reinvent the wheel more times than anyone who isn't a complete masochist would enjoy.
From sqlite FAQ I've known that:
Multiple processes can have the same database open at the same time.
Multiple processes can be doing a SELECT at the same time. But only
one process can be making changes to the database at any moment in
time, however.
So, as far as I understand I can:
1) Read db from multiple threads (SELECT)
2) Read db from multiple threads (SELECT) and write from single thread (CREATE, INSERT, DELETE)
But, I read about Write-Ahead Logging that provides more concurrency as readers do not block writers and a writer does not block readers. Reading and writing can proceed concurrently.
Finally, I've got completely muddled when I found it, when specified:
Here are other reasons for getting an SQLITE_LOCKED error:
Trying to CREATE or DROP a table or index while a SELECT statement is
still pending.
Trying to write to a table while a SELECT is active on that same table.
Trying to do two SELECT on the same table at the same time in a
multithread application, if sqlite is not set to do so.
fcntl(3,F_SETLK call on DB file fails. This could be caused by an NFS locking
issue, for example. One solution for this issue, is to mv the DB away,
and copy it back so that it has a new Inode value
So, I would like to clarify for myself, when I should to avoid the locks? Can I read and write at the same time from two different threads? Thanks.
For those who are working with Android API:
Locking in SQLite is done on the file level which guarantees locking
of changes from different threads and connections. Thus multiple
threads can read the database however one can only write to it.
More on locking in SQLite can be read at SQLite documentation but we are most interested in the API provided by OS Android.
Writing with two concurrent threads can be made both from a single and from multiple database connections. Since only one thread can write to the database then there are two variants:
If you write from two threads of one connection then one thread will
await on the other to finish writing.
If you write from two threads of different connections then an error
will be – all of your data will not be written to the database and
the application will be interrupted with
SQLiteDatabaseLockedException. It becomes evident that the
application should always have only one copy of
SQLiteOpenHelper(just an open connection) otherwise
SQLiteDatabaseLockedException can occur at any moment.
Different Connections At a Single SQLiteOpenHelper
Everyone is aware that SQLiteOpenHelper has 2 methods providing access to the database getReadableDatabase() and getWritableDatabase(), to read and write data respectively. However in most cases there is one real connection. Moreover it is one and the same object:
SQLiteOpenHelper.getReadableDatabase()==SQLiteOpenHelper.getWritableDatabase()
It means that there is no difference in use of the methods the data is read from. However there is another undocumented issue which is more important – inside of the class SQLiteDatabase there are own locks – the variable mLock. Locks for writing at the level of the object SQLiteDatabase and since there is only one copy of SQLiteDatabase for read and write then data read is also blocked. It is more prominently visible when writing a large volume of data in a transaction.
Let’s consider an example of such an application that should download a large volume of data (approx. 7000 lines containing BLOB) in the background on first launch and save it to the database. If the data is saved inside the transaction then saving takes approx. 45 seconds but the user can not use the application since any of the reading queries are blocked. If the data is saved in small portions then the update process is dragging out for a rather lengthy period of time (10-15 minutes) but the user can use the application without any restrictions and inconvenience. “The double edge sword” – either fast or convenient.
Google has already fixed a part of issues related to SQLiteDatabase functionality as the following methods have been added:
beginTransactionNonExclusive() – creates a transaction in the “IMMEDIATE mode”.
yieldIfContendedSafely() – temporary seizes the transaction in order to allow completion of tasks by other threads.
isDatabaseIntegrityOk() – checks for database integrity
Please read in more details in the documentation.
However for the older versions of Android this functionality is required as well.
The Solution
First locking should be turned off and allow reading the data in any situation.
SQLiteDatabase.setLockingEnabled(false);
cancels using internal query locking – on the logic level of the java class (not related to locking in terms of SQLite)
SQLiteDatabase.execSQL(“PRAGMA read_uncommitted = true;”);
Allows reading data from cache. In fact, changes the level of isolation. This parameter should be set for each connection anew. If there are a number of connections then it influences only the connection that calls for this command.
SQLiteDatabase.execSQL(“PRAGMA synchronous=OFF”);
Change the writing method to the database – without “synchronization”. When activating this option the database can be damaged if the system unexpectedly fails or power supply is off. However according to the SQLite documentation some operations are executed 50 times faster if the option is not activated.
Unfortunately not all of PRAGMA is supported in Android e.g. “PRAGMA locking_mode = NORMAL” and “PRAGMA journal_mode = OFF” and some others are not supported. At the attempt to call PRAGMA data the application fails.
In the documentation for the method setLockingEnabled it is said that this method is recommended for using only in the case if you are sure that all the work with the database is done from a single thread. We should guarantee than at a time only one transaction is held. Also instead of the default transactions (exclusive transaction) the immediate transaction should be used. In the older versions of Android (below API 11) there is no option to create the immediate transaction thru the java wrapper however SQLite supports this functionality. To initialize a transaction in the immediate mode the following SQLite query should be executed directly to the database, – for example thru the method execSQL:
SQLiteDatabase.execSQL(“begin immediate transaction”);
Since the transaction is initialized by the direct query then it should be finished the same way:
SQLiteDatabase.execSQL(“commit transaction”);
Then TransactionManager is the only thing left to be implemented which will initiate and finish transactions of the required type. The purpose of TransactionManager – is to guarantee that all of the queries for changes (insert, update, delete, DDL queries) originate from the same thread.
Hope this helps the future visitors!!!
Not specific to SQLite:
1) Write your code to gracefully handle the situation where you get a locking conflict at the application level; even if you wrote your code so that this is 'impossible'. Use transactional re-tries (ie: SQLITE_LOCKED could be one of many codes that you interpret as "try again" or "wait and try again"), and coordinate this with application-level code. If you think about it, getting a SQLITE_LOCKED is better than simply having the attempt hang because it's locked - because you can go do something else.
2) Acquire locks. But you have to be careful if you need to acquire more than one. For each transaction at the application level, acquire all of the resources (locks) you will need in a consistent (ie: alphabetical?) order to prevent deadlocks when locks get acquired in the database. Sometimes you can ignore this if the database will reliably and quickly detect the deadlocks and throw exceptions; in other systems it may just hang without detecting the deadlock - making it absolutely necessary to take the effort to acquire the locks correctly.
Besides the facts of life with locking, you should try to design the data and in-memory structures with concurrent merging and rolling back planned in from the beginning. If you can design data such that the outcome of a data race gives a good result for all orders, then you don't have to deal with locks in that case. A good example is to increment a counter without knowing its current value, rather than reading the value and submitting a new value to update. It's similar for appending to a set (ie: adding a row, such that it doesn't matter which order the row inserts happened).
A good system is supposed to transactionally move from one valid state to the next, and you can think of exceptions (even in in-memory code) as aborting an attempt to move to the next state; with the option to ignore or retry.
You're fine with multithreading. The page you link lists what you cannot do while you're looping on the results of your SELECT (i.e. your select is active/pending) in the same thread.
I learned that it is necessary to use same key in both two processes to communicate using shared memory. In sample code I've seen , the key is hard coded in both programs(sender,receiver). My doubt is in real time how two unexpected processes use the same key.
I've read about ftok() function, but it asks for file path as argument. But how it is possible in real time as below scenario
suppose when user give print to file command from firefox some other program like ghostscript is going to make a ps/pdf file(assuming it uses shared memory). Here how firefox and ghostscript will use shared memory
Two processes unknown to each other would need to use a defined (and shared) protocol in order to use shared memory together. And that protocol would need to include the information about how to get to the shared memory (e.g., an integer value for a shmget call). Basically, it would need to define a "hard coded" identifier or some method for discovering it.
Without some kind of protocol defining this information (including what is in the memory), it would not be possible for one process to even deduce what was in a memory location that was set up by another process.
Let's presuppose that you have R running with root/admin privileges. What R calls do you consider harmful, apart from system() and file.*()?
This is a platform-specific question, I'm running Linux, so I'm interested in Linux-specific security leaks. I will understand if you block discussions about R, since this post can easily emerge into "How to mess the system up with R?"
Do not run R with root privs. There is no effective way to secure R in this way, since the language includes eval and reflection, which means I can construct invocations to system even if you don't want me to.
Far better is to run R in a way that cannot affect the system or user data, no matter what it tries to do.
Anything that calls external code could also be making system changes, so you would need to block certain packages and things like .Call(), .C(), .jcall(), etc.
Suffice it to say that it will end up being a virtually impossible task, and you are better off running it in a virtualized environment, etc. if you need root access.
You can't. You should just change the question: "How do I run user-supplied R code so as not to harm the user or other users of the system?" That's actually a very interesting question and one that can be solved with a little bit of cloud computing, apparmor, chroot magic, etc.
There are tons of commands you could use to harm the system. A handful of examples: Sys.chmod, Sys.umask, unlink, any command that allows you to read/write to a connection (there are many), .Internal, .External, etc.
And if you blocked users from those commands, there's nothing stopping them from implementing something in a package that you wouldn't know to block.
As noted by just about every response to this thread, removing the "potentially harmful" calls in the R language would:
Be potentially impossible to do completely.
Be difficult to do without spending significant time writing complicated (i.e. ugly) hacks.
Kneecap the language by removing a ton of functionality that makes R so flexible.
A safer solution that doesn't require modifying/rewriting large parts of the R language would be to run R inside a jail using something like BSD Jails, Jailkit or Solaris Zones.
Many of these solutions allow the jailed process to exercise root-like privileges but restrict the areas of the computer that the process can operate on.
A disposable virtual machine is another option. If a privileged user thrashes the virtual environment, just delete it and boot another copy.
One of my all time favorites. You don't even have to be r00t.
library(multicore);
forkbomb <- function(){
repeat{
parallel(forkbomb());
}
}
forkbomb();
To adapt a cliche from gun rights people, "system() isn't harmful - people who call system() are harmful".
No function calls are intrinsically harmful, but if you allow people to use them freely then those people may cause harm.
Also, the definition of harm will depend on what you consider harmful.
In general, R is so complex that you can assume that there is a way to trick it in executing data with seemingly harmless functions, for instance through buffer overflow.
I'm hoping to take advantage of Amazon spot instances which come at a lower cost but can terminate anytime. I want to set it up such that I can send myself data mid-way through a script so I can pick up from there in the future.
How would I email myself a .rdata file?
difficulty: The ideal solution will not involve RCurl since I am unable to install that package on my machine instance.
The same way you would on the command-line -- I like the mpack binary for that which you find in Debian and Ubuntu.
So save data to a file /tmp/foo.RData (or generate a temporary name) and then
system("mpack -s Data /tmp/foo.RData you#some.where.com")
in R. That assumes the EC2 instance has mail setup, of course.
Edit Per request for a windoze alternative: blat has been recommended by other for this task.
There is a good article on this in R News from 2007. Amongst other things, the author describes some tactics for catching errors as they occur, and automatically sending email alerts when this happens -- helpful for long simulations.
Off topic: the article also gives tips about how the linux/unix tools screen and make can be very useful for remote monitoring and automatic error reporting. These may also be relevant in cases when you are willing to let R email you.
What you're asking is probably best solved not by email but by using an EBS volume. The volume will persist regardless of the instance (note though that I'm referring to an EBS volume as opposed to an EBS-backed instance).
In another question, I mention a bunch of options for checkpointing and related tools, if you would like to use a separate function for storing your data during the processing.