Bigdata use cases [closed] - bigdata

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
we are trying to create a dashboard using BigData. The Data are currently transacted in SQLServer and the front end is in MVC. As the data flow is extremely high to analyse using SQLServer itself it is decided to use BigData. I had chosen Cloudera Manager CDH, SQOOP to import data from SQLServer to HIVE and running the analytic using IMPALA. Decided to up the results with Microstrategy to provide the charts in mobile platform to the clients.
Any Ideas or suggestion are welcome to improve the process?

Looks like you're off to a great start. Remember your analytics can be done with a multitude of tools, not just Impala.
Once you're in Hadoop, Hive and Pig give a lot of power (more available with UDFS) with an easy learning curve.
If you eventually want to do some iterative use cases (and exploit machine learning), you might want to check out Spark (those two things are in its wheelhouse), which is not constrained by (to?) MapReduce.
Tons of great tools available. Enjoy the journey.

I would consider to use two stages. Data Analysis and Data Visualisation.
Use two stages makes the solution more flexible and decouple the responsiblility.
Data Analysis
Ingest the data (Include cleaning), Sqoop can do the ingest step, might require extra steps to cleaning the data.
Explore/Analyse the data, Apache Spark is a very flexible and powerful tool.
Store the analyse result in a specified format
Data Visualisation
Load the data from data analysis phase
Visualise it. Using Highcharts/Kibana/Dashing. Or use D3 create customised dashboard.

Related

What are the limitations of Shiny apps compared to another web programming language like Ruby on Rails? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am using Shiny to create interactive graphs on a website, but it doesn't seem to have support for things like comment threads, or database storage. Are you supposed to somehow use Shiny within another language?
This question was downvoted, and I hope I won't lose scarce rep points by answering it. I can't speak for the Shiny development team, and I'm only a novice Shinyapps developer, but ...
It seems to me that Shiny aims to make it easy for for R programmers to build small to medium-sized, self-contained, web-based graphic-centric interactive data-analysis displays, without adding an unreasonable amount of code to what they wrote to do their actual work, i.e. the analysis. This is a fairly common requirement for researchers and practitioners (as opposed to full-time professional developers) coming from the R heritage and culture (stats and data science). Shiny achieves this aim pretty well!
You can find out more about the kinds of problems that Shiny aims to solve by going to the source. Note that it says Turn your analyses into interactive web applications, not Build a full-service website with interactive chat and a backing store. It sounds as if you want something different in scale and kind, and you may be wasting time by trying to shoehorn your requirement into the Shiny problem/solution space. I've occasionally hammered nails into wood using a pair of pliers because my toolbox was at the bottom of the ladder, but that didn't make it the right thing to do!

Need advice to choose graph database [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm looking for options for graph database to be used in a project. I expect to have ~100000 writes (vertix + edge) per day. And much less reads (several times per hour). The most frequent query takes 2 edges depth tracing that I expect to return ~10-20 result nodes.
I don't have experience with graph databases and want to work with gremlin to be able to switch to another graph database if needed. Now I consider 2 possibilities: neo4j and Titan.
As I can see there is enough community, information and tools for Neo4j, so I'd prefer to start from it. Their capacity numbers should be enough for our needs (∼ 34 billion nodes, ∼ 34 billion edges). But I'm not sure which hardware requirements will I face in this case. Also I didn't see any parallelisation options for their queries.
On the other hand Titan is built for horizontal scalability and has integrations with intensively parallel tools like spark. So I can expect that hardware requirements can scale in a linear way. But there is much less information/community/tools for Titan.
I'll be glad to hear your suggestions
Sebastian Good made a wonderful presentation comparing several databases to each other. You might have a look at his results in here.
A quick summary of the presentation is here
For benchmarks on each graph databases with different datasets, different node sizes and caches, please have a look at this Github repository by socialsensor. Just to let you know, the results in the repo are a bit different that the ones in the presentation.
My personal recommendation is:
If you have deep pockets, go for Neo4j. With the technical support and easy CIPHER, things will go pretty quickly.
If you support Open Source (and are patient for its development cycles), go for Titan DB with Amazon Dynamo DB backend. This will give you "infinite" scalability and good performance with both EC2 machines and Dynamo tables. Check here for docs and here for their code for more information.

How to make a choice between OpenTSDB and InfluxDB or other TSDS? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
They both are open source distributed time series databases, OpenTSDB for metrics, InfluxDB for metrics and events with no external dependencies, on the other OpenTSDB based on HBase.
Any other comparation between them?
And if I want to store and query|analyze metrics real-time with no deterioration loss based on time series, which would be better?
At one of the conferences I've heard people running something like Graphite/OpenTSDB for collecting metrics centrally and InfluxDB locally on each server to collect metrics only for this server. (InfluxDB was chosen for local storage as it is easy to deploy and lightweight on memory).
This is not directly related to your question but the idea appealed to me much so I wanted to share it.
Warp 10 is another option worth considering (I'm part of the team building it), check it out at http://www.warp10.io/.
It is based on HBase but also has a standalone version which will work fine for volumes in the low 100s billions of datapoints, so it should fit most use cases out there.
Among the strengths of Warp 10 is the WarpScript language which is built from the ground up for manipulating (Geo) Time Series.
Yet another open-source option is blueflood: http://blueflood.io.
Disclaimer: like Paul Dix, I'm biased by the fact that I work on Blueflood.
Based on your short list of requirements, I'd say Blueflood is a good fit. Perhaps if you can specify the size of your dataset, the type of analysis you need to run or any other requirements that you think make your project unique, we could help steer you towards a more precise answer. Without knowing more about what you want to do, it's going to be hard for us to answer more meaningfully.

Advice on structuring an social science experiment in Meteor [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am playing around with Meteor as a framework for building web-deployed economics experiments. I am seeking advice on how to structure the app.
Here is the logic:
Users create an account and are put in a queue.
Groups of size N are formed from users in the queue.
Each group proceeds together through a series of stages.
Each stage consists of:
a. The display of information to and collection of simple input from group members.
b. Computing new information based on the inputs of all group members.
Groups move from stage to stage together, because the group data from one stage provides information needed in the next stage.
When the groups complete the last stage, data relating to their session is saved for analysis.
If a user loses their connection, they can rejoin by logging in and resume the state they were in before.
It is much like a multiplayer game, and I know there are many examples of those, so perhaps the answer to my question is just a pointer to a similarly structured and well-developed Meteor game.
I'm writing a framework for deploying multi-user experiments on Meteor. It basically does everything you asked for and a lot more.
https://github.com/HarvardEconCS/turkserver-meteor
We should talk! I'm currently using this for a couple of experiments but it is not well documented yet. However, it is a redesign of https://github.com/HarvardEconCS/TurkServer so all of the important features are there. Also, I'm looking to broaden out from using Amazon Mechanical Turk for subject recruitment, and Meteor makes that easy.
Even if you're not interested in using this, you should check out the source for how people are split into groups and kept segregated. The original idea came from here, but I've expanded on it significantly so that it is automagically managed by the smart package instead of having to do it yourself.

Modularization of PL/SQL packages [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Currently I am doing a restructuring project mainly on the Oracle PL/SQL packages in our company. It involves working on many of the core packages of our company. We never had documentation for the back end work done so far and the intention of this project is to create a new set of APIs based on the current logic in a structured way along with avoiding all unwanted logic that currently exists in the system.
We are also making a new module currently for the main business of the organization that would work based on these newly created back-end APIs.
As I started of this project, I found out that most of the wrapper APIs had around more than 8000 lines of code. I managed to covert this code into many single APIs and invoked them from the wrapper API.
This activity in itself has been a time-consuming process but I was able to cut down the number of lines of code to just 900 in the wrapper API by calling independent APIs for each business functionality.
I would like to know from you experts if this mode of modularizing the code is good and worth the time invested in it as I am not sure if it would have many performance benefits.
But from a code readability perspective, this is definitely helping and now I am able to understand the 8000 lines of code much better after restructuring and I am sure the other developers in my organization too will understand.
Requesting you to let me know if I am doing the right thing and if its having its advantages apart from readability please do mention them. Sorry for the long explanation.
And is it okay having more than 1000 lines of code in a wrapper API.
Easy to debug
Easy to update
Easy to Modify/maintain
Less change proneness due to low coupling.
Increases reuse if the modules are made generic
Can identify unused code easily

Resources