What does big data have to do with cloud computing? - bigdata

What does big data have to do with cloud computing?
I have try to explain the relation between big data and cloud computing.

They overlap. One is not dependent on the other. Cloud computing enables companies to rent infrastructure over time rather than pay up-front for computers and maintain it over time.
In general, cloud vendors allow you to rent out large amounts of server pools and build networks of servers (clusters).
You can pay for servers with large storage drives and install software like Hadoop FileSystem (HDFS), Ceph, GlusterFS, etc. These softwares will make a single "shared filesystem". The more servers you combine together into this filesystem, the more data you can store.
Now, that's just storage. Hopefully, these servers also have some reasonable amount of memory and CPU processing. Other technology such as YARN (with Hadoop), Apache Mesos, Kubernetes/Docker allow you to create resource pools to deploy distributed applications that spread over all those servers and read data that's stored in all those other machines.
The above is mostly block storage, though. The alternative, cheaper alternative is object storage such as Amazon S3, which is a Hadoop Compatible filesystem. There are other object storage solutions, but people use this as it's more highly available (via replication) and can be secured easier with access keys and policies

Big data and Cloud Computing are one of the most used technologies in today’s Information Technology world. With these two technologies, business, education, healthcare, research & development, etc are growing rapidly and will provide various advantages to expand their areas with tricks and techniques.
In cloud computing, we can store and retrieve the data from anywhere at any time. Whereas, big data is the large set of data which will process to extract the necessary information.
A customer can shift to Cloud Computing when they need rapid deployment and scaling of the applications. The application deals with highly sensitive data and requires strict compliance one should keep things on the cloud. Whereas, we can use Big Data for traditional methods and here frameworks are ineffective. Big data is not replacement for relational database system and big data solve specific problem statement related to large data sets and most of the large data sets do not deal with small data.
Big Data Technology is Hadoop, MapReduce, and HDFS. while Cloud Computing includes three types which are: public, private, hybrid and community cloud.
Cloud computing provides enterprises a cost-effective & flexible way to access a vast volume of information we call the Big Data. Because of Big Data and cloud computing, it is now much easier to start an IT company than ever before. When the combination of big data and cloud computing was first initiated, it opened the road to endless possibilities. Various fields have seen many drastic changes that were made possible by this combination. It changed the decision-making process for companies and gave a huge advantage to analysts, who could base their results on concrete data.

Related

What is missing in JuliaDB to use it as production database in a website backend?

I have some difficulties to understand the pros and the cons of using JuliaDB as a main backend database for a production website.
https://juliadb.org/
My use case is a collaborative data sciences platform. The client request 1 million unique visitors and 100 000 writes per day. Well... I wish so.
Implementing a SQL database mean that I need to "translate" the data science dataframes used for the calculus into SQL and backwards.
In the other hands JuliaDB is an end to end solution.
Regarding the different criterions for a website production database:
Julia natively hase concurrency:
Julia supports three main categories of features for concurrent and
parallel programming:
Asynchronous "tasks", or coroutines Multi-threading Distributed
computing Julia Tasks allow suspending and resuming computations for
I/O, event handling, producer-consumer processes, and similar
patterns. Tasks can synchronize through operations like wait and
fetch, and communicate via Channels.
Multi-threading functionality builds on tasks by allowing them to run
simultaneously on more than one thread or CPU core, sharing memory.
Finally, distributed computing runs multiple processes with separate
memory spaces, potentially on different machines. This functionality
is provided by the Distributed standard library as well as external
packages like MPI.jl and DistributedArrays.jl.
In the other hand, JuliaDB doc tells that they support parallel computing but does not give much details.
Can JuliaDB handle parallel connections and asynchronous operation, making it performant for lots of users using it in parrallel?
From your question it looks that what you need is a massively parallel data ingestion mechanism. You a software architecture that allows to concurrently collect data for huge amount of users.
Perhaps you should look at one of noSQL databases that provide horizontal scaling capability good example could be MongoDB (or perhaps a cloud equivalent such as DynamoDB).
If your data volume and parallelism is even higher you should consider streaming solution such as Apache Kafka.
On the other hand JuliaDB is totally on the other site of processing workflow. Once your massive data gets collected it ends up in an analytic process. In the recent years the most popular tool was the Hadoop stack with Apache Spark used for processing.
JuliaDB brings a new paradigm on the analytics step of data workflow. With this tool you can massively parallelize processing of huge data and hence you should consider JuliaDB as a nice alternative to Spark.

Ocean Protocol: Decentralized Big Data Sharing and Artificial Intelligence

Ocean Protocol claims that it created a "decentralized data marketplace" using decentralization of data sharing by blockchain. They say that their platform can be used in Artificial Intelligence. However, the amount of data employed in Artificial Intelligence is hugely big in size. (Please have a look here, where it introduces this protocol.)
One of the serious questions is that where does data get stored, especially, when the sizes of data are huge? The Ocean protocol white paper answers this question as follows:
Ocean itself does not store the data. Instead, it links to data that
is stored, and provides mechanisms for access control. The most
sensitive data (e.g. medical data) should be behind firewalls,
on-premise. Via its services framework, Ocean can bring the compute to
the data, using on-premise compute. Other data may be on a centralized
cloud (e.g. Amazon S3) or decentralized cloud (e.g. Filecoin). In both cases it should be encrypted. This means that
the Ocean blockchain does not store data itself. This also means that
we can remove the data, if it’s not on a decentralized and immutable
substrate.
So, Does not it mean that Ocean blockchain cannot provide immutability of shared data?
You are correct. Ocean itself does not store the data, so it has no control over whether the data is stored immutably (or not).

Cassandra : isolated workloads

I have three workloads.
DATACENTER1 sharing data by rest services - streaming ingest
DATACENTER2 load bulk - analysis
DATACENTER3 research
I want to isolated workloads, i am going to create one datacenter foreach workloads.
The objective of the operation is to prevent a heavy process from consuming all the resources and gurantee hight availablity data.
Is anyone already trying this ?
During a loadbulk on datacenter2, is data availability good on datacenter1 ?
Short answer is that workload won't cause disruption of load across datacenter. How it works is as follows:
Conceptually when you create a Keyspace, Cassandra creates a Virtual Data Center (VDC). Nodes with similar workloads must be assigned to same VDC. Segregating workload will ensure that only (exactly) one workload is ever executed at a VDC. As long as you follow this pattern, it works.
Data sync needs to be monitored under load on busy nodes but thats a normal concern on any Cassandra deployment.
Datastax Enterprise also support this model as can be seen from:
https://docs.datastax.com/en/datastax_enterprise/4.6/datastax_enterprise/deploy/deployWkLdSep.html#deployWkLdSep__srchWkLdSegreg

cluster vs Grid vs Cloud

There are two questions:
1) What is the difference between cluster and Grid
2) What is the Cloud
I am not looking for conceptual definitions,
I found a lot of that by googling but the problem is I still do not get it.
so I believe the answer I seek is different. From what I could re-search online I start to think that
many article writers who is trying to explain this either do not understand this deep enough themselves
or not able to explain their knowledge for an average guy like myself (which is common issue with very technical people).
Just to let you know my level: I am a computer programmer, .NET and LAMP, I can do basic admin on both
Linux flavors and Windows, I have hands on experience with Hyper-V and now researching Xen and XCP
to setup a test cloud based on two computers for learning purposes.
Below info you do not have to read, it is just my current understanding of cluster,grid and cloud it
just to support my two questions because I thought it would help to understand
what kind of mess is in my head right now and what answers I am looking for.
Thank you.
Two computers used for reference in my statements are "A" and "B"
specs for A: 2 core intel cpu, 8GB memory , 500gb disk
specs for "B": 2 core intel cpu, 8GB memory , 500gb disk,
Now I would like to look at A and B roles from Cluster, Grid and from Cloud angle.
Common definitions between Grid and Cloud
1) cluster or Grid are 2 or more computers hooked up together, on hardware level
they are hooked up though network cards and on a software level
it is using some kind of program implementing message passing interface
to make it possible to send commands between nodes.
2) cluster or Grid do NOT combine CPU power or memory between nodes, meaning
that in this simulation a FireFox browser running on A still has only one 2 cores cpu,
8GB memory and 500gb available.
Differences between Grid and Cloud:
1) Cluster only provides fail over part, if A node breaks while FireFox is running
the cluster software will re-start FireFox process on node B.
2) Grid however is able to run a software in parallel on multiple nodes at the same time
provided that software is coded with MPI in mind. It can also lunch any software on any node
on demand (even if it is not written for MPI)
3) Grid is also able to combine different type of
nodes, Linux Server, Windows XP, Xbox and Playstation into one Grid.
Cloud definition:
1) Cloud is not a technical term at all, it is just a short convenient word to describe
a computer of unlimited resources, it can aslo be called a Supercomputer, a Beast, an Ocean or Universe but someone
said "Cloud" first and here we are.
2) Cloud can be based on Grids or on Clusters
3) From technical point of view Cloud is a software to combine hardware resources into one,
meaning that if I install Cloud software on Grid or Cluster then it will combine A and B
and I will get one Cloud like this: 4 core CPU, 16gb memory and 1000gb disk.
edited: 2013.04.02
item 3) was a complete nonsense, cloud will NOT combine resources from many nodes into one huge resource, so in this case there will be no 4 core CPU, 16gb memory and 1000gb cloud.
Grid computing is designed to parcel out large workloads to many participating grid members--through software on each member which is expecting to hear that request for computation or for data, and to reply with it's small piece of the overall puzzle. Applications must be written specifically for this approach to problem-solving. It can be heterogeneous because it's not the OS that matters but the software waiting to hear problem-solving requests.
The expectation of a cluster is that it can run the same executable image across any member node--any node can execute that code--which is what drives its requirement for homogeneity. You can write cluster-aware code which distributes workload throughout the cluster, but again you have to write your code to be cluster aware in order to take advantage of more than the redundancy features of a cluster. As most application vendors do not write cluster-aware code, the simple redundancy feature is all that's commonly used in cluster deployments, but that does not limit the architecture. Clusters can and do share their resources, and can collaborate on tasks simultaneously.
Cloud, as it's commonly defined is neither of these, precisely, but it doesn't preclude them, either. Cloud computing assumes the ability to deploy an application without advanced knowledge of it's underlying operating system, or even control of that operating system, coupled with the ability to expand or reduce the processing and memory footprint available to that application without having to destroy and recreate that environment--all done with enough isolation that the application won't know or be able to know what other applications might be installed or running on it's shared infrastructure, unless that access is approved-of by both application managers.
I would like to answer my question before this is closed as a duplicate because I believe it can be very frustrating to find correct info in regards to clusters,grids and clouds and I think this post can save time for many. If someone wants to challenge it please do so, otherwise I will mark it as answer in 1 week.
1) There are many differences and there are none, it really depends on the technical context but
generally you can connect several nodes and call it a Grid or you can call it Cluster. I would say Grid is a Cluster with extended capabilities, such as ability to connect heterogeneous nodes. Both Grid and Cluster will serve as scale-out platform equally good. From Network Engineer and Programmer perspective the difference in implementation or coding will be pretty big if Gird connects heterogeneous nodes.
2) Now the first question was actually a prelude for second one and I believe it is best answered by
Matt Joyce in this post:
https://stackoverflow.com/a/15286488/2230126
I'll take a crack at it. I have been collecting and saving my notes, scripts, and programs since the year 2002 A.D. This is a chop and paste of my statements over the years. Here is a brain friendly memorization list:
The grid is the hardware and hardware specifications.
a. You plug into the router or switch and setup IP addresses and top-level domains over the internet (which is also known as ICANN).
b. This is like OSI level 1, 2, and 3.
The cluster is the kernel (software ring 0 or 1 if its a virtual type thing going on).
a. The kernel is configured (compiled) to run a network stack that can handle sessions, permission, and account authentication.
b. You set up port to port communications usually over TCP/IP (like in the OSI model).
c. You setup iptables, pf, arp, and other OS level applications or shared objects.
d. You can setup ssh, kerberos, ldap, or some other PKI-database and protocol-socket combo.
e. This is like OSI level 4, 5, and 6.
The cloud is user-space applications.
a. The application processes talk to other application-processes within the cluster.
b. You setup process level permissions (via files, cgroups, and/or user-groups).
c. You setup mysql, redis, riak, Message Brokers, hadoop, apache, nginx, cron, java, haskell, erlang, and etcetera.
d. This is like OSI level 7.
The cloud floats over the cluster that grows from the grid. And actually visually think, cloud in the air, cluster in tree, and grid on the ground. Most of us creative types (which make all these technologies) are visual thinkers that can back it up with mathematical data and code. So always see if you can answer the riddle and correlate technological facsimiles to our physical realm here on Earth.
Intro
Grid, Cluster, and Cloud are three different words that mark their specific time in history. Their definitions have intersecting traits and they are modernly interchangeable. You just need to know when to apply the correct or associated word. For example, I was talking to some older M.D.s (medical doctors) and they wanted to know what the cloud was. So I told them that the cloud was a computer cluster that you rent over the internet. And Bingo, they got the idea within 10 seconds.
I will use a little bit of history in chronological prose.
Grid
The term grid is first used to represent one resource that is repeated across terrestrial landscape or space. The term is frequently used during the distribution of telegraphs where repeaters had to be placed on poles every N radii (plural for radius) to amplify the signal. Another example is the electrical grid that Thomas Edison and Nikola Tesla competitively started spreading around the Earth. Computers got really popular and they soon were expanded across The Grid to replace human telegraph (and telephone) operators.
The Grid is now a bunch of computers that can connect and terminate communication channels. The Grid is an infrastructure of computers that function for one goal which is the run assembly (or binary) code.
Cluster
Farseeing the power of computers and actually witnessing computers win wars (Turing's machine), DARPA (or ARPA which is the U.S.A. Military) stepped in.
DARPA started commissioning universities and colleges to utilize the Grid for multi-plexing communication methods (that use baud and protocols). Universities and colleges started making protocols to separate the different tasks that they wanted to carry out over the Grid and target the computers. That started the modern internet. In-house testing clusters were established in laboratories to simulate the grid. Clusters are great for orchestration. A job can be sub-divided over all or some of the slaves within a cluster. The military utilized the college and university's findings and applied the SOFTWARE to the Grid. There were some gotchas with clusters:
Must be same (or near same) hardware
Must have same operating system
The rules were strict because all the instruction-sets had to be the same passing over the CPUs. Clusters usually had a master and slave type relationship. A Cluster usually ran one unic (or unix) job at a time. Clusters had job-schedulers. Then clusters got more complex because hardware manufacturers started making parallel chip architectures (on top of the Von Neumann arch).
Clusters become more powerful. The Clusters inherited more complexity and people were doing more creative things. Cluster could now do different jobs, tasks, processes, asynchronously processes, synchronized processes, and many more interesting things. One box (or computer node) could run more jobs. Now the Grid could be used for multiple purpose. The rate of software updates on clusters was faster than the actual grid. Clusters were deployed locally on campuses. Clusters started superseding the grid because you could directly produce a public facing stack that out-performed the (national) grid.
My Experience
I went to college during the late 1990s and 2000s and cluster was the word for a physical laboratory of multiple computers working as one virtual computer. Clusters were used for testing. Once your software worked on the cluster, then you could mv (move) it to the production grade Grid. Then I witness network worms and computer viruses control zombie computers. These swarm of zombies could be used as one gigantic virtual cluster used to run commands. Well programmers started DIY (do it yourself) protocols and software like bit-torrent and Napster.
So leaping forward into the future, testing cluster softwares are starting to be replaced by Solaris jails, FreeBSD jails, Linux containers, QEMU, hyper-visors, VMWare, VirtualBox, Vagrant, and Docker.
Cloud
Cloud is a marketing term used to umbrella the hardware of different grids and the software of those clusters. Cloud is one big ubiquitous word used to advertise, promote, and profess all that cluster technology for monetary gains. Cloud is also an effort to wrap all those technologies under one singular word. The Cloud allows multi-tenanted processes to share a gigantic grid. The Cloud maximizes efficiency by sub-dividing the electricity, CPU, RAM, DISK, Electricity, and broadband which gets shared and paid for by consumers. A side effect is that those consumer subscriptions and/or pay-rates started producing profit. The Cloud also allows multiple users to install multiple operating systems that run multiple processes all in the software. So now we have acronyms like IaaS, PaaS, and SasS. The Cloud can replace the start-up cost that was once so darn difficult to fund and bootstrap. The Cloud is a great solution for mock testing your software and building a consumer base for your business.
From another perspective, the Cloud triggers the brain of non-programmers to think a certain way. For example, the human resource department can comprehend and isolate what is presented in-front of them.
So if you got the money, then you can purchase your share of the cloud experience and have easy support along with it. But if you have the skill-set, the time, the quick know-how, and the ability to install your own servers at co-locations, then do that because it is cheaper over the long run.
That is my narrative on the Grid vs Cluster vs Cloud.
I think this link well compared the Cluster and Grid.
As I know, there are some exceptions in the case of Clusters. YARN (Yahoo!) tries to handle mutli-tenancy and distributed scheduling. Also Corona (Facebook) has distributed scheduling.

Distributed C++ game server which use database

My C++ turn-based game server (which uses database) does not stand against current average amount of clients (players), so I want to expand it to multiple (more then one) amount of computers and databases where all clients still will remain within single game world (servers will must communicate with each other and use multiple databases).
Is there some tutorials/books/common standards which explain how to do it in a best way?
The way you put the database into the picture might be misleading: clustering solutions exist for all of the mostly used RDBMS, so that if you need to support your DB activities with more than one DB node you will just have to check the documentation from your DB vendor.
More complex scenarios are there when it comes to synchronize your non-DB application state that needs to be shared among several servers. There are already a number of questions here that tackle the same problem, like here or here
You might also be interested into some messaging system, I heard good things about ZeroMQ
Hope this helps.

Resources