Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I fully understand that this question may be closed as it can be more of an opinion than an actual technical question with an objective answer. However I want to ask it in case someone can help and provide a good response. I think it is important to define what you do in a succint way, so here it goes.
Q: If you are asked, "what's Data Engineering?" what would your definition be? (Not "what does a Data Engineer do?")
This one came to mind, but does someone have a better one? And I am talking in the context of Hadoop/Big Data
A:
Data engineering is the process of taking Big Data that is stored in
either a structured or unstructured format, processing it in batch or
real-time, and generating data in a new format that can be used for
further consumption, visualization, Machine Learning or Data Science
I would like to share what I think is a definition of Data Engineering related to Big Data:
Data Engineering supports and provides expertise to elaborate,
contruct, and maintain a Big Data. Data Engineering uses
tools, techniques, frameworks and skills that are essential to a good
"Data Infrastructure" or "Data Architecture" behind a Big Data.
A good way to define Data Engineering is understanting what a Data Enginer does. Here is a great infographic about: https://www.datacamp.com/community/blog/data-engineering-vs-data-science-infographic
Some responsibilities listed includes:
Develop, construct, teste and maintain architectures;
Ensure architecture will support the requirements of the business;
Develop data set process for data modeling, mining and production;
Recommend ways to improve data reliability, efficiency and quality.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
this is more of an hypothetical question, but might have great consequences. Lot of us in Modelica community are dealing with large scale systems with expensive simulation times. This is usually not an obstacle for bugfixing and development, but speeding up the simulation might allow for better and faster optimizations.
Recently I came across Modia possibilities, claiming to have superb numerical solvers, achieving better simulation times than Dymola, a state-of-the-art Modelica compiler. The syntax seemed to cover all important bits. Recreating large scale component models in Modia is unfeasible, but what about automatically translating the flattenized Modelica to Modia? Is that realistic? Would that provide a speed up? Has anyone tried before? I have searched for some
This might also hopefully improve integration of Modelica models and postprocesssing / identificaiton tooling within one language, instead of using FMI or invoking a separate executable.
Thanks for any suggestions.
For those interested, we might as well start developing this.
We in the Modia team agrees that the modeling know how in Modelica libraries must be reused. So we are working on a translator (brief details given in https://ep.liu.se/ecp/157/060/ecp19157060.pdf) from Modelica to Modia. The plan is to initially provide translated versions of Modelica.Blocks, Modelica.Electrical.Analog and Modelica.Mechanics together with Modia.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have a small project involving some simple financial time-series data with some real-time components on the front end. I was hoping to use the Firebase infrastructure since it offers a lot of things without having to set up much infrastructure, but upon investigating it doesn't seem to be a good choice for storing time series data.
Admittedly, I have more experience with relational databases so it's possible I am asking an extremely basic question. If I were to use Firestore to store time-series data, could someone provide an example of how one might structure it for efficient querying?
Am I better served using something like Postgres?
Probably best bet would be to use a time-series database. Looks like Warp 10 has already been mentioned (https://www.warp10.io).
The benefit of something like warp is the ability to query on the time component of your database. I believe firebase only has simple greater/lesser than queries available for time.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Recently, I was reading a paper whose name is "On the Effectiveness of Concern Metrics to Detect Code Smells: An Empirical Study".
I come from a non-English speaking country, and I can not quite understand what Concern Metrics means in the field of software engineering.
It is not referring to the relationship between objects?
I have some understanding of java and c #, some people may be able to use java to give me an example.
Thanks.
Like it is said in the paper's abstract: "While traditional metrics quantify properties of software modules, concern metrics quantify concern properties, such as scattering and tangling." Are you familiar to the cross-cutting concern concept? This question provides examples of concerns: Cross cutting concern example Try to read papers on aspect-oriented programming (AOP) to grasp more concepts in order to understand better the relationship between concerns and code. The metrics are attempts to quantify, for instance, the amount of scatterness of a concern (e.g. login) over the source code.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
What can you do after learning big data related concepts, Hadoop, ML, NLP etc? Where can you implement these?
Not really a software related question - but it's very relevant to current technology and why some software exists. So here is an opinion.
We now live in a world where it is possible to monitor and digitally record information on an epic scale that continues to expand with concepts like The Internet of Things.
With this information it becomes possible to look at the evidence behind decisions that previously would have been made by gut instinct or opinions. What impact does road design have on traffic flow? Which medical drugs get best results in the real world and not just in drugs trials? Is there a correlation between office temperature and productivity? and so on for millions of questions in different domains.
Around the world, organisations are using data they never previously had, to get better at whatever they do (Good or bad).
The big data concepts are the tools for managing all this information. Big Data is not just large in volume, it is often unstructured and in different forms.
So to answer your question. You can implement these concepts by working with organisations that are using Big Data. Hopefully you can see the potential of Big Data along with the mind bending headache it can create when trying to make sense of it.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
They both are open source distributed time series databases, OpenTSDB for metrics, InfluxDB for metrics and events with no external dependencies, on the other OpenTSDB based on HBase.
Any other comparation between them?
And if I want to store and query|analyze metrics real-time with no deterioration loss based on time series, which would be better?
At one of the conferences I've heard people running something like Graphite/OpenTSDB for collecting metrics centrally and InfluxDB locally on each server to collect metrics only for this server. (InfluxDB was chosen for local storage as it is easy to deploy and lightweight on memory).
This is not directly related to your question but the idea appealed to me much so I wanted to share it.
Warp 10 is another option worth considering (I'm part of the team building it), check it out at http://www.warp10.io/.
It is based on HBase but also has a standalone version which will work fine for volumes in the low 100s billions of datapoints, so it should fit most use cases out there.
Among the strengths of Warp 10 is the WarpScript language which is built from the ground up for manipulating (Geo) Time Series.
Yet another open-source option is blueflood: http://blueflood.io.
Disclaimer: like Paul Dix, I'm biased by the fact that I work on Blueflood.
Based on your short list of requirements, I'd say Blueflood is a good fit. Perhaps if you can specify the size of your dataset, the type of analysis you need to run or any other requirements that you think make your project unique, we could help steer you towards a more precise answer. Without knowing more about what you want to do, it's going to be hard for us to answer more meaningfully.