Is there a limit in data storage here: https://jupyter.org/try?
Thank you!
The Jupiter demo page that you referred to uses https://mybinder.org/ under the hood. In the FAQ section, they specify limits to the available RAM as 1GB - 2GB. However, they don't specify limits to storage space.
The reason for this is that the typical use case is to store all your data in a git repo, such as GitHub, so Binder uses a similar business model. GitHub also does not put limits on storage (see Repository size limits for GitHub.com). However, the larger your repo, the longer it will take to run your project, imposing a natural limit.
https://github.com/binder-examples/getting-data also provides some good insight regarding the various approaches to loading data into your mybinder docker container. Restricting network speed, blocking FTP and limiting traffic to sites like GitHub also limit how much data you can pull into your docker container.
At the end of the day, the storage limits are based on respect. Don't abuse the platform.
Related
So far I've only used Jupyter on my local machine, which is way too slow. I'm completely new to using cloud services for Jupyter, or using cloud services at all for that matter. I know there are a million tutorials out there, but this is my problem: How to choose the right service from all those options (Amazon? Google? Cheaper options?)? What's the 'right way' to get started?
What I need:
I want a service where I can start up a Jupyter notebook in my browser as simply as possible. (I know next to nothing about setting up servers etc., and have very limited time to learn that if needed)
I currently have an old MacBook from 2014. The server should be at least 10x faster. (Which options do I need to pick?)
I want to do machine learning, so GPUs would be good.
My budget is about $50 per month, less would be great; a free tryout would be great too.
As I am completely new, I also need to know what pitfalls to look out for. (E.g.: Stop the machine to stop increasing the costs?)
If you could help me, or point me to a good tutorial or even a book, I'd be forever grateful.
(Sorry for the basic question. Of course I googled tutorials myself before posting this question, but as indicated above, I'm overwhelmed by the options - that's why I posted this question.)
AWS based tutorial:
https://aws.amazon.com/de/getting-started/hands-on/get-started-dlami/
GPU, CPU and pricing informations are gathered here:
https://docs.aws.amazon.com/dlami/latest/devguide/pricing.html
You can set up a budget for cost limitation:
https://aws.amazon.com/de/getting-started/hands-on/control-your-costs-free-tier-budgets/
I've set up NXRM 3.14 with a ceph (S3 compatible) blobstore back-end. I've been testing it both on physical hardware and inside a docker container.
It "works" but is much, much slower than uploading directly to the bucket (a 2 second upload directly to the bucket may take 2 minutes through NXRM)
I haven't found any bugs or complaints about this, so I'm guessing it's specific to ceph and that the performance may be fine with S3. Uploads to the local filesystem are also very fast.
I've found nothing in the log files to indicate performance problems.
Sorry this question is extremely vague, but does anyone have recommendations for debugging NXRM performance or maybe is anyone using a similar setup? Thanks.
I eventually tracked this down in the NXRM open source code, the current MultipartUploader is single threaded (https://github.com/sonatype/nexus-public/blob/master/plugins/nexus-blobstore-s3/src/main/java/org/sonatype/nexus/blobstore/s3/internal/MultipartUploader.java) and uploads chunks sequentially.
For files larger than 5mb, this introduces a considerable slowdown in upload times.
I've submitted an improvement suggestion on their issue tracker: https://issues.sonatype.org/browse/NEXUS-19566
What is the meaning of this number? How do I know how many disks I need for a developer box?
How do I know how many disks I need for a developer box?
I was searching for an official answer to this a couple of weeks ago, but couldn't find anything specific about that in the official documentation.
However, I found this community entry, which contains a lot of helpful information regarding that topic.
What is the meaning of this number?
The number itself describes how many data disks your Virtual Machine will actually have attached in Azure. You can then configure each of these individually.
I personally use 2 data disks with 256Gb each for development which works well, but less would also be enough.
The data disk configuration for that setup looks like this in Azure:
The OS disk and build disk are both fixed.
We need to be able to supply big files to our users. The files can easily grow to 2 or 3GB. These files are not movies or similiar. They are software needed to control and develop robots in an educational capacity.
We have some conflict in our project group in how we should approach this challenge. First of all, Bittorrent is not a solution for us (despite the goodness it could bring us). The files will be availiable through HTTP (not FTP) and via a filestream so we can control who gets access to the files.
As a former pirate in the early days of the internet i have often struggled with corrupt files and using filehashes and filesets to minimize the amount of redownload required. I advocate a small application that downloads and verifies a fileset and extracts the big install file once it is completely downloaded and verified.
My colleagues don't think this is nessecary and point to the TCP/IP protocols inherit capabiltities to avoid corrupt downloads. They also mention that Microsoft has moved away from a downloadmanager for their MSDN files.
Are corrupt downloads still a widespread issue or will the amount of time we spend creating a solution to this problem be wasted, compared to the amount of people who will actually be affected by it?
If a download manager is the way to go, what approach would you suggest we take?
-edit-
Just to clearify. Is downloading 3GB of data in one chunk, over HTTP a problem OR should we make our own EXE that downloads the big file in smaller chunks (and verifies them).
You do not need to go for your own download manager. You can use some really smart approach.
Split files in smaller chunks, let's say 100MB each. So even if a download is corrupted, user will end-up downloading with that particular chunk.
Most of web servers are capable of understanding and treating/serving range headers. You can recommend the users to use download manager / browser add-ons which can use this capacity. If your users are using unix/linux systems, wget is such a utility.
Its true that TCP/IP has capacities of preventing corruption but it basically assumes that network is still up and accessible. #2 mentioned above can be one possible work-around to the problems where network was completely down in middle of download.
And finally, it is always good to provide file hash to your users. This is not only to ensure the download but also to ensure the security of the software that you are distributing.
HTH
In 2004 Mahendra gives a talk about using Plone with DSpace to manage digital assets.
http://linux-bangalore.org/2004/schedules/talkdetails.php?talkcode=C0300032
Mahendra said:
Zope provides a lot of features and an excellent architecture for handling digital content. However, Zope has issues as the stored data scales to the order of Giga/Tera bytes. A combination of Zope + Plone is great as a portal management system, but if an attempt is made to use it for storing digital assets, performance can drop down.
So he propposes using DSpace to manage digital assets instead that Plone. But maybe it has changed today. What are the limits to use Plone as a Digital Assets Manager now?
Since that article was written, the ZODB (Zope's data storage system) has grown support for blobs stored as separate files on the filesystem, which basically means you're limited by the capabilities of the filesystem in use. I know of multiple installations with more than 20GB of data, which was the number mentioned in the article.
Now if you want to catalog the assets so they can be easily found, then you'll hit other limits based on the sophistication of your catalog algorithms and data structures. Plone can handle quite a bit of data, but it tends to require lots of RAM and careful tuning (and probably customization) once you get beyond 50,000 content items or so.