Designing an OS abstraction layer

Designing an OS abstraction layer - abstraction

While Developing an OS Abstraction layer for a multi-modular system, which approach should one take:
Create a Shared Library of OS services and each module is built to use it and runs as individual processes.
OR
Create only a single instance of abstraction layer which provides memory, timer services and which alone spawns all instances of modules.
What are the pros and cons of these approaches? Also lay down any other if possible?

Back when it was my job to architect, manage and lead (mostly by-doing;-) such layers, I had an easy time deciding: some operating systems (VMS, then-brand-new Win-NT) had very heavy-weight process spawning (so it really needed a lot of provocation to spawn a new process!-), while at the other extreme others (such as BSD 4.3 with its then-brand-new vfork!-) positively encouraged you to spawn as many processes as you wished and had very little overhead in doing so. Therefore, I judged that leaving to the application programmers the decision of whether to spawn or not was irresponsible - we (the "Base Libraries" group) really had to provide an abstraction layer that would spawn or not depending on the underlying operating system. It worked pretty well... but also it was 15/20 years ago, on a certain class of machines (workstations) with a wide variety of OS's but reasonably uniform endowments in memory, cores (one: no multi-cores back then!-) and the like.
Were I to take the same job today, I'd first push the stakeholders (top mgmt, product marketing, or whoever) for a clear definition of the range of platforms we have to support -- what OS's, what range(s) of HW endowments. If -- as I suspect -- that's as wide in its way as it was back then, my architecture would be similar in terms of hiding the concept of process spawning from application programmers. But maybe the target range is different, e.g. "smartphones and cheap netbooks", which would make the choice iffier... though abstracting process creation away from app programmers is the SAFE choice, and you should expose this layer only if you're willing to take a risk on said app programmers' general skills AND whatever uniformity you can count on now in your range of platforms, remaining reasonably stable in the future!-)

Related

Why does process creation using `clone` result in an out-of-memory failure?

I have a process that allocates about 20GB of RAM on a 32GB machine. After some events, I'm streaming the data from the parent process to stdin of the child process. It's mandatory to keep the 20GB of data in the parent process at the point when the child is spawned.
The app is written in Rust and I'm calling Command::new('path/to/command') to create the child process.
When I spawn the child process the operating system is trapping an out-of-memory error.
strace output:
[pid 747] 16:04:41.128377 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7ff4c7f87b10) = -1 ENOMEM (Cannot allocate memory)
Why does the trap occur? The child process should not consume more than 1GB and exec() is called immediately after clone().

The Problem
When a child process is created by the Rust call, several things happen at a C/C++ level. This is a simplification, but it will help explain the dilemma.
The streams are duplicated (with dup2 or a similar call)
The parent process is forked (with the fork or clone system call)
The forked process executes the child (with call from the execvp family)
The parent and child are now concurrent processes. The Rust call you are currently using appears to be a clone call that is behaving much like a pure fork, so you're 20G x 2 - 32G = 8G short, without considering the space needed by the operating system and anything else that might be running. The clone call is returning with a negative return value and errno is set by the call to ENOMEM errno.
If the architectural solutions of adding physical memory, compressing the data, or streaming it through a process that does not require the entirety of it to be in memory at any one time are not options, then the classic solution is reasonably simple.
Recommendation
Design the parent process to be lean. Then spawn two worker children, one that handles your 20GB need and the other that handles your 1 GB need1. These children can be connected to one another via pipe, file, shared memory, socket, semaphore, signalling, and/or other communication mechanism(s), just as a parent and child can be.
Many mature software packages from Apache httpd to embedded cell tower routing daemons use this design pattern. It is reliable, maintainable, extensible, and portable.
The 32G would then likely suffice for the 20G and 1G processing needs, along with OS and lean parent process.
Although this solution will surely solve your problem, if the code is to be reused or extended later, there may be value in looking into potential process design changes involving data frames or multidimensional slices to support streaming of data and memory requirement reductions.
Memory Overcommit Always
Setting overcommit_memory to 1 eliminates the clone error condition referenced in the question because the Rust call calls the LINUX clone call that reads that setting. But there are several caveats with this solution that point back to the above recommendation as superior, primarily that the value of 1 is dangerous, especially for production environments.
Background
Kernel discussions about OpenBSD rfork and the clone call ensued in the late 1990s and early 2000s. The features stemming from those discussions permit less extreme forking than processes, which is symmetrically like the provision of more extensive independence between pthreads. Some of these discussions have produced extensions to the traditional process spawning that have entered POSIX standardization.
In the early 2000s, Linux Torvalds suggested a flag structure to determine what components of the execution model are shared and what are copied when execution forks, blurring the distinction between processes and threads. From this, the clone call emerged.
Over-committing memory is not discussed much if any in those threads. The design goal was MORE control of the results of a fork rather than the delegation of memory usage optimization to an operating system heuristic, which is what the default setting of overcommit_memory = 0 does.
Caveats
Memory overcommit goes beyond these extensions, adding the complexity of trade-offs of its modes2, design trend caveats3, practical run time limitations4, and performance impacts5.
Portability and Longevity
Additionally, without standardization, the code using memory overcommit may not be portable, and the question of longevity is pertinent, especially when a setting controls the behavior of a function. There is no guarantee of backward compatibility or even some warning of deprication if the setting system changes.
Danger
The linuxdevcenter documentation2 says, "1 always overcommits. Perhaps you now realize the danger of this mode.", and there are other indications of danger with ALWAYS overcommitting 6, 7.
The implementers of overcommit on LINUX, Windows, and VMWare may guarantee reliability, but it is a statistical game that, combined with the many other complexities of process control, may lead to certain unstable characteristics under certain conditions. Even the name overcommit tells us something about its true character as a practice.
A non-default overcommit_memory mode, for which several warnings are issues, but works for the immediate trial of the immediate case may later lead to intermittent reliability.
Predictability and Its Impact on System Reliability and Response Time Consistency
The idea of a process in a UNIX like operating system, from its Bell Labs beginnings, is that a process makes a concrete requests to its container, the operating system. The result both predictable and binary. Either the request is denied or granted. Once granted, the process is given complete control and direct access over the resources until the use of it is relinquished by the process.
The swap space aspect of virtual memory is a breach of this principle that appears as gross deceleration of activity on workstations, when RAM is heavily consumed. For instance, there are times during development when one presses a key and has to wait ten seconds to see the character on the display.
Conclusion
There are many ways to get the most out of physical memory, but doing so by hoping that use of memory allocated will be sparse will likely introduce negative impacts. Performance hits from swapping when overcommit is overused is the well documented example. If you are keeping 20G of data in RAM, this may particularly be the case.
Only allocating what is needed, forking in intelligent ways, using threads, and freeing memory that is surely no longer needed lead to memory thrift without impacting reliability, creating spikes in swap disk usage, and can operate without caveat up to the limits of system resources.
The position of the designer of the Command::new call may be based on this perspective. In this case, how soon after the fork the exec is called is not a determining factor in how much memory is requested during the spawn.
Notes and References
[1] Spawning worker children may require some code refactoring and appear to be too much trouble on a superficial level, but the refactoring may be surprisingly straightforward and significantly beneficial.
[2] http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html?page=2
[3] https://www.etalabs.net/overcommit.html
[4] http://www.gabesvirtualworld.com/memory-overcommit-in-production-yes-yes-yes/
[5] https://labs.vmware.com/vmtj/memory-overcommitment-in-the-esx-server
[6] https://github.com/kubernetes/kubernetes/issues/14452
[7] http://linuxtoolkit.blogspot.com/2011_08_01_archive.html

cluster vs Grid vs Cloud

There are two questions:
1) What is the difference between cluster and Grid
2) What is the Cloud
I am not looking for conceptual definitions,
I found a lot of that by googling but the problem is I still do not get it.
so I believe the answer I seek is different. From what I could re-search online I start to think that
many article writers who is trying to explain this either do not understand this deep enough themselves
or not able to explain their knowledge for an average guy like myself (which is common issue with very technical people).
Just to let you know my level: I am a computer programmer, .NET and LAMP, I can do basic admin on both
Linux flavors and Windows, I have hands on experience with Hyper-V and now researching Xen and XCP
to setup a test cloud based on two computers for learning purposes.
Below info you do not have to read, it is just my current understanding of cluster,grid and cloud it
just to support my two questions because I thought it would help to understand
what kind of mess is in my head right now and what answers I am looking for.
Thank you.
Two computers used for reference in my statements are "A" and "B"
specs for A: 2 core intel cpu, 8GB memory , 500gb disk
specs for "B": 2 core intel cpu, 8GB memory , 500gb disk,
Now I would like to look at A and B roles from Cluster, Grid and from Cloud angle.
Common definitions between Grid and Cloud
1) cluster or Grid are 2 or more computers hooked up together, on hardware level
they are hooked up though network cards and on a software level
it is using some kind of program implementing message passing interface
to make it possible to send commands between nodes.
2) cluster or Grid do NOT combine CPU power or memory between nodes, meaning
that in this simulation a FireFox browser running on A still has only one 2 cores cpu,
8GB memory and 500gb available.
Differences between Grid and Cloud:
1) Cluster only provides fail over part, if A node breaks while FireFox is running
the cluster software will re-start FireFox process on node B.
2) Grid however is able to run a software in parallel on multiple nodes at the same time
provided that software is coded with MPI in mind. It can also lunch any software on any node
on demand (even if it is not written for MPI)
3) Grid is also able to combine different type of
nodes, Linux Server, Windows XP, Xbox and Playstation into one Grid.
Cloud definition:
1) Cloud is not a technical term at all, it is just a short convenient word to describe
a computer of unlimited resources, it can aslo be called a Supercomputer, a Beast, an Ocean or Universe but someone
said "Cloud" first and here we are.
2) Cloud can be based on Grids or on Clusters
3) From technical point of view Cloud is a software to combine hardware resources into one,
meaning that if I install Cloud software on Grid or Cluster then it will combine A and B
and I will get one Cloud like this: 4 core CPU, 16gb memory and 1000gb disk.
edited: 2013.04.02
item 3) was a complete nonsense, cloud will NOT combine resources from many nodes into one huge resource, so in this case there will be no 4 core CPU, 16gb memory and 1000gb cloud.

Grid computing is designed to parcel out large workloads to many participating grid members--through software on each member which is expecting to hear that request for computation or for data, and to reply with it's small piece of the overall puzzle. Applications must be written specifically for this approach to problem-solving. It can be heterogeneous because it's not the OS that matters but the software waiting to hear problem-solving requests.
The expectation of a cluster is that it can run the same executable image across any member node--any node can execute that code--which is what drives its requirement for homogeneity. You can write cluster-aware code which distributes workload throughout the cluster, but again you have to write your code to be cluster aware in order to take advantage of more than the redundancy features of a cluster. As most application vendors do not write cluster-aware code, the simple redundancy feature is all that's commonly used in cluster deployments, but that does not limit the architecture. Clusters can and do share their resources, and can collaborate on tasks simultaneously.
Cloud, as it's commonly defined is neither of these, precisely, but it doesn't preclude them, either. Cloud computing assumes the ability to deploy an application without advanced knowledge of it's underlying operating system, or even control of that operating system, coupled with the ability to expand or reduce the processing and memory footprint available to that application without having to destroy and recreate that environment--all done with enough isolation that the application won't know or be able to know what other applications might be installed or running on it's shared infrastructure, unless that access is approved-of by both application managers.

I would like to answer my question before this is closed as a duplicate because I believe it can be very frustrating to find correct info in regards to clusters,grids and clouds and I think this post can save time for many. If someone wants to challenge it please do so, otherwise I will mark it as answer in 1 week.
1) There are many differences and there are none, it really depends on the technical context but
generally you can connect several nodes and call it a Grid or you can call it Cluster. I would say Grid is a Cluster with extended capabilities, such as ability to connect heterogeneous nodes. Both Grid and Cluster will serve as scale-out platform equally good. From Network Engineer and Programmer perspective the difference in implementation or coding will be pretty big if Gird connects heterogeneous nodes.
2) Now the first question was actually a prelude for second one and I believe it is best answered by
Matt Joyce in this post:
https://stackoverflow.com/a/15286488/2230126

I'll take a crack at it. I have been collecting and saving my notes, scripts, and programs since the year 2002 A.D. This is a chop and paste of my statements over the years. Here is a brain friendly memorization list:
The grid is the hardware and hardware specifications.
a. You plug into the router or switch and setup IP addresses and top-level domains over the internet (which is also known as ICANN).
b. This is like OSI level 1, 2, and 3.
The cluster is the kernel (software ring 0 or 1 if its a virtual type thing going on).
a. The kernel is configured (compiled) to run a network stack that can handle sessions, permission, and account authentication.
b. You set up port to port communications usually over TCP/IP (like in the OSI model).
c. You setup iptables, pf, arp, and other OS level applications or shared objects.
d. You can setup ssh, kerberos, ldap, or some other PKI-database and protocol-socket combo.
e. This is like OSI level 4, 5, and 6.
The cloud is user-space applications.
a. The application processes talk to other application-processes within the cluster.
b. You setup process level permissions (via files, cgroups, and/or user-groups).
c. You setup mysql, redis, riak, Message Brokers, hadoop, apache, nginx, cron, java, haskell, erlang, and etcetera.
d. This is like OSI level 7.
The cloud floats over the cluster that grows from the grid. And actually visually think, cloud in the air, cluster in tree, and grid on the ground. Most of us creative types (which make all these technologies) are visual thinkers that can back it up with mathematical data and code. So always see if you can answer the riddle and correlate technological facsimiles to our physical realm here on Earth.
Intro
Grid, Cluster, and Cloud are three different words that mark their specific time in history. Their definitions have intersecting traits and they are modernly interchangeable. You just need to know when to apply the correct or associated word. For example, I was talking to some older M.D.s (medical doctors) and they wanted to know what the cloud was. So I told them that the cloud was a computer cluster that you rent over the internet. And Bingo, they got the idea within 10 seconds.
I will use a little bit of history in chronological prose.
Grid
The term grid is first used to represent one resource that is repeated across terrestrial landscape or space. The term is frequently used during the distribution of telegraphs where repeaters had to be placed on poles every N radii (plural for radius) to amplify the signal. Another example is the electrical grid that Thomas Edison and Nikola Tesla competitively started spreading around the Earth. Computers got really popular and they soon were expanded across The Grid to replace human telegraph (and telephone) operators.
The Grid is now a bunch of computers that can connect and terminate communication channels. The Grid is an infrastructure of computers that function for one goal which is the run assembly (or binary) code.
Cluster
Farseeing the power of computers and actually witnessing computers win wars (Turing's machine), DARPA (or ARPA which is the U.S.A. Military) stepped in.
DARPA started commissioning universities and colleges to utilize the Grid for multi-plexing communication methods (that use baud and protocols). Universities and colleges started making protocols to separate the different tasks that they wanted to carry out over the Grid and target the computers. That started the modern internet. In-house testing clusters were established in laboratories to simulate the grid. Clusters are great for orchestration. A job can be sub-divided over all or some of the slaves within a cluster. The military utilized the college and university's findings and applied the SOFTWARE to the Grid. There were some gotchas with clusters:
Must be same (or near same) hardware
Must have same operating system
The rules were strict because all the instruction-sets had to be the same passing over the CPUs. Clusters usually had a master and slave type relationship. A Cluster usually ran one unic (or unix) job at a time. Clusters had job-schedulers. Then clusters got more complex because hardware manufacturers started making parallel chip architectures (on top of the Von Neumann arch).
Clusters become more powerful. The Clusters inherited more complexity and people were doing more creative things. Cluster could now do different jobs, tasks, processes, asynchronously processes, synchronized processes, and many more interesting things. One box (or computer node) could run more jobs. Now the Grid could be used for multiple purpose. The rate of software updates on clusters was faster than the actual grid. Clusters were deployed locally on campuses. Clusters started superseding the grid because you could directly produce a public facing stack that out-performed the (national) grid.
My Experience
I went to college during the late 1990s and 2000s and cluster was the word for a physical laboratory of multiple computers working as one virtual computer. Clusters were used for testing. Once your software worked on the cluster, then you could mv (move) it to the production grade Grid. Then I witness network worms and computer viruses control zombie computers. These swarm of zombies could be used as one gigantic virtual cluster used to run commands. Well programmers started DIY (do it yourself) protocols and software like bit-torrent and Napster.
So leaping forward into the future, testing cluster softwares are starting to be replaced by Solaris jails, FreeBSD jails, Linux containers, QEMU, hyper-visors, VMWare, VirtualBox, Vagrant, and Docker.
Cloud
Cloud is a marketing term used to umbrella the hardware of different grids and the software of those clusters. Cloud is one big ubiquitous word used to advertise, promote, and profess all that cluster technology for monetary gains. Cloud is also an effort to wrap all those technologies under one singular word. The Cloud allows multi-tenanted processes to share a gigantic grid. The Cloud maximizes efficiency by sub-dividing the electricity, CPU, RAM, DISK, Electricity, and broadband which gets shared and paid for by consumers. A side effect is that those consumer subscriptions and/or pay-rates started producing profit. The Cloud also allows multiple users to install multiple operating systems that run multiple processes all in the software. So now we have acronyms like IaaS, PaaS, and SasS. The Cloud can replace the start-up cost that was once so darn difficult to fund and bootstrap. The Cloud is a great solution for mock testing your software and building a consumer base for your business.
From another perspective, the Cloud triggers the brain of non-programmers to think a certain way. For example, the human resource department can comprehend and isolate what is presented in-front of them.
So if you got the money, then you can purchase your share of the cloud experience and have easy support along with it. But if you have the skill-set, the time, the quick know-how, and the ability to install your own servers at co-locations, then do that because it is cheaper over the long run.
That is my narrative on the Grid vs Cluster vs Cloud.

I think this link well compared the Cluster and Grid.
As I know, there are some exceptions in the case of Clusters. YARN (Yahoo!) tries to handle mutli-tenancy and distributed scheduling. Also Corona (Facebook) has distributed scheduling.

Is OCaml suitable to write networking servers?

I was wondering if OCaml will perform well in terms of performance and ease of implementation while dealing with typical client/server interactions over TCP in a multi threaded environment.. I mean something really typical like having a thread per client that receives data, operated changes on game states and send them back to clients.
This because I need to write a server for a game and I always did these things in C but since now I know OCaml I was curious to know if it would be ok or I'll just find myself trying to solve a typical problem in a language that doesn't fit well that.

Performance: probably not. OCaml's threads do not provide parallel execution, they are only a way to structure your program. The OCaml runtime itself is not thread-safe, so the only code that could possibly execute in parallel of a single OCaml thread would be interfaced C code (without callbacks to OCaml!).
Implementation-wise, there is a mutex on the run-time, which is released when calling blocking C primitives, and could also be released when calling C functions that do significant work.
Ease of implementation: it wouldn't be world-changing. You would have the comfort of OCaml and a pthread-like library on the side. If you are looking for new things to discover while leveraging what you have learnt of OCaml, I recommend Jocaml. It goes in and out of sync with OCaml, but there was a (re-)re-implementation quite recently, and even when it is slightly out of sync, it is a lot of fun, and a completely new perspective of concurrent programs.
Jocaml is implemented on top of OCaml. What with the run-time not being concurrent and all, I am almost sure it uses separate processes and message-passing. But for the application that you mentioned it should be able to do fine.

OCaml is quite suitable for writing network servers, although as Pascal observes, there are limitations on threading.
Fortunately, however, threading isn't the only way to organize such a program. The Lwt library (for Light Weight Threads) provides an abstraction of asynchronous I/O that is quite easy to use (particularly when combined with a bit of syntax support). Everything actually runs in one thread, but it's all driven by an asynchronous I/O loop (built on the Unix select call), and the programming style lets you write code that looks like direct code (avoiding much of the normal code overhead of doing asynchronous I/O in many other languages). For example:
lwt my_message = read_message socket in
let repsonse = compute_response my_message in
send_response socket response
Both the read and the write happen back in the main event loop, but you avoid the normal "read, calling this function when you're done" manual overhead.

I'm so sorry this question has been sitting here for eight years with what I consider to be several quite bad answers because they all ignore the elephant in the room.
You say "really typical like having a thread per client" but having an OS thread per client is an extremely bad design. Threads are heavyweight, taking a long time to create and destroy and consuming ~1MB just for the thread stack. If you have one thread per connection then 1,000 simultaneous client connections (which is entirely feasible) will burn 1GB of RAM just for their stacks and the performance of your program (in any language) will be cripppled by the amount of context switching required to get any work done. You don't want to use that design in any language including both C and OCaml. Note that this problem is especially bad in the context of tracing garbage collected languages because the GC also traverses all of those thread stack in order to collate global roots before every GC cycle. I am the first to admit that this anti-pattern is ubiquitous in the real world but please don't copy it! I have seen "low latency" servers in the finance industry written in C++ using one thread per connection and they suffered latency stalls of up to six seconds just from the (Windows) OS servicing those threads.
See: http://people.eecs.berkeley.edu/~sangjin/2012/12/21/epoll-vs-kqueue.html
Let's consider an efficient design instead, like an epoll or kqueue interface to the OS kernel giving the server's code information about incoming and outgoing data buffers. Single threaded servers can attain excellent performance with this design. However, a typical server has serialization work to do per client and some core work that is often performed in serial across all client connections. Therefore, serialization and deserialization can be parallelized but the core server operation cannot. In this context, OCaml is great for everything except the serialization layer because it has poor support for parallelism.
I have personally implemented many servers for various industries with hugely varying performance requirements. In my experience, OCaml is one of the best tools for this because it offers excellent libraries (easy to use and reliable) and excellent serial performance. The only issue I have is around parallelizing the serialization layer but, in practice, I have found that OCaml runs circles around alternatives like Java and .NET even though they can parallelize this. I found typical latencies were ~100us for .NET and 10us for OCaml.
See also: http://prl.ccs.neu.edu/blog/2016/05/24/measuring-gc-latencies-in-haskell-ocaml-racket/

OCaml will work great for networking applications as long as you can live with a relatively small number of threads active at one time—say no more than 100. You could consider MLdonkey as an example, although in the client space, not in the server space.

Haskell would be a better choice if you want to use many preemptive threads. GHC can support huge numbers of threads and they run in parallel on multicore systems. OCaml prefers cooperative multithreading and multiple processes.

MPI overhead in shared memory setup

I want parallelize a program. It's not that difficult with threads working on one big data-structure in shared memory.
But I want to be able to use distribute it over cluster and I have to choose a technology to do that. MPI is one idea.
The question is what overhead will have MPI (or other technology) if I skip implementation of specialized version for shared memory and let MPI handle all cases ?
Update:
I want to grow a large data structure (game tree) simultaneously on many computers.
Most parts of it will be only on one cluster node but some of it (unregular top of the tree) will be shared and synchronized from time to time.
On shared memory machine I would like to have this achieved through shared memory.
Can this be done generically?

All the popular MPI implementations will communicate locally via shared memory. The performance is very good as long as you don't spend all your time packing and unpacking buffers (i.e. your design is reasonable). In fact, the design imposed upon you by MPI can perform better than most threaded implementations because the separate address space improves cache coherence. To consistently beat MPI, the threaded implementations have to be aware of the cache hierarchy and what the other cores are working on.
With good network hardware (like InfiniBand) the HCA is responsible for getting your buffers on and off the network so the CPU can do other things. Also, since many jobs are memory bandwidth limited, they will perform better using, e.g. 1 core on each socket across multiple nodes than when using multiple cores per socket.

It depends on the algorithm. Clealy inter-cluster communication is orders of magnitude slower than shared memory either as inter-process communication or multiple threads within a process. Therefore you want to minimize inter-cluster traffic, E.g. by duplicating data where possible and practicable or breaking the problem down in such a way that minimizes inter node communication.
For 'embarrisngly' parallel algorithms with little inter-node communication it's an easy choice - these are problems like brute force searching for encryption key where each node can crunch numbers for long periods and report back to a central node periodically but no communication is required to test keys.

Microcontroller + Verilog/VHDL simulator?

Over the years I've worked on a number of microcontroller-based projects; mostly with Microchip's PICs. I've used various microcontroller simulators, and while they can be very helpful at times, I often find myself frustrated. In real life microcontrollers never exist alone and the firmware's behavior is dependent on the environment. However, none of the sims I've used provide decent support for anything outside the microcontroller.
My first thought was to model the entire board in Verilog. But, I'd rather not create an entire CPU model, and I haven't had much luck finding existing models for the chips I use. Regardless, I really don't need, or want, to simulate the proc at that level of detail, and I'd like to retain the debugging facilities provided by a regular processor sim.
It seems to me that the ideal solution would be a hybrid simulator that interfaces a traditional processor simulator with a Verilog model.
Does such a thing exist?

I've used the Altera Nios II processor embedded on a FPGA. Altera provides a toolchain for simulating the CPU (with its software) together with your custom logic in a simulator. I suppose that a similar setup can be achieved by downloading a VHDL/Verilog core of your CPU (Did you try opencores ? They have lots of stuff there).
But keep in mind that it is going to be mind-bogglingly slow, so don't expect to simulate whole complex processes this way. The best you can hope for is simulating fine software-hardware interaction points to debug problems. If you need a deeper simulation, consider running it on a FPGA with built-in monitoring code.

For the "simulate the whole board" approach,
The Free Model Foundry has a large number of models, some in VHDL others in Verilog, that are available now.. but you'll need to pay to have new models created. These are very helpful in being sure the board is built correctly.
But I think the more common approach when debugging your PIC is to just build a board, then work on the firmware. In the chip world, (where the firmware is running on a microprocessor in a chip that hasn't gone to fab yet) people often resort to very expensive systems (or renting time on them) that allow compiling part of the design into an emulator while the rest of the design runs in the normal simulator environment. Without the barrier of an expensive mask set for the chip, the cost is just not justifiable for a Circuit board. Although I've heard of some creative applications of Simulink (Mathworks) with FPGA's, but my recollection is that one either ran the system on the computer, or programmed the device and ran the same thing in realtime.
I believe both Cadence (ask about Palladium) and Mentor Graphics have that integrated solution if you have the money to spend on it.

What I have done recently is create an interface between the simulation environment and host system. Different hdl simulators have different interfaces, and getting the simulator NOT think in batch mode, the traditional simulation model, instead run for ever like a real design is half of the problem.
Then from the host using C (or whatever) you can create abstractions that may or may not allow you to write your application software for whatever target (depending on what language and compiler capabilities you have). For example you can make a generic poke and peek function and on the final target have those actually poke and peek memory or I/O, but for simulation through the abstraction you talk to a testbench in the simulation that simulates the same memory cycle.
I went one step further and used (Berkeley) sockets between the host and test bench so that the simulation can keep running while the host applications stop and start. Not unlike having a real processor with an OS that you are starting applications and running them to completion and starting another. At least for test applications, for delivery you probably only have one app.
By creating these abstraction layers I can write real applications that will be used on the target when it is built. Along the way you can use software simulation of the logic initially, then if you like build an fpga with an abstraction interface (throw away logic) say a uart for example. Replace the shim between the applications abstraction layer and the simulator with a uart interface, or whatever. Then when you marry the processor and logic in the same chip or on the same board, replace the abstraction layer again with direct calls to whatever interfaces they have always though they were talking to. If something breaks and you have retained the abstraction layer you can take the application back to the simulation model and have access to all of your logic internals.
Specifically this time around I am using a hdl language cyclicity cdl which is on sourceforge, the documentation needs some help but the examples may get you going, and it produces synthesizable verilog, so you get an extra win there. I threw out all the scripting batch stuff other than the bare minimum needed to connect and start a C simulation model. So my test bench is in C (well C++ technically) the sockets layer was done there. The output can be .vcd files which gtkwave uses. Basically you can do the bulk of your HDL design using open source software with no licenses, etc. By adding one or two lines of code to the CDL simulation part I was able to have it run as an infinite loop, which I can say works quite well, there doesnt appear to be any memory leaks, etc.
both modelsim and cadence have standardized ways of connecting host C programs to the simulation world and from there you can use an IPC to get to host applications talking to an abstraction layer api.
this is probably way overkill for a pic, I have given up pics a while ago for the faster and C friendly arm based micros anyway. There is/was an open core pic that you could simply incorporate into your simulation, even though that is not what you are trying to do here.

Not that I've seen. Your best bet is to properly define the interfaces and behavior between the uC and FPGA and then define a series of test waveforms that can be applied using an automated tester. You would have to make the automated tester (or perhaps a logic analyzer may have some such functionality) out of an FPGA or uC (apply waveform, watch interrupts, breakpoints, etc). If you really want I know that Opencores.org has PIC and AVR-like 8-bit uC cores defined as VHDL, so you could implement your entire project on the FPGA and then just debug that.

Generally there isn't need to model the CPU at the RTL level. Since you don't really care about what it does bit by bit; you generally care about what it does, e.g. register values, memories and bus access.
The simplest is call at Bus Functional Model. This just generates the read and writes that the CPU does, often based on a text file. These are available for some CPUs and many popular buses (e.g. PCI, PCIe). THese simulate super fast.
The next step up is a functional cycle-accurate model. Those simulate fast. They are often encrypted.
Last is a full RTL model. Those usually are only available if you are working closely with the CPU vendor, e.g. using their core in your ASIC. Typically these are encrypted, unless you are a huge company.
Memory models are typically cycle-accurate (e.g. Micron).

My workmates from the hardware department use FPGA simulation software quite often to find timing-bugs and trace down strange behaviours.
Simulating one or two milliseconds can take several hours, so using the simulator for anything but very small things is not feasable.
You may want to have a look at SystemC though. http://en.wikipedia.org/wiki/SystemC

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex