Google Data Flow Compared to Ms SSIS ETL Tool - bigdata

Hi All GCP Developers,
I am Newbie to GCP Data Engineering Products, With an Experience in Microsoft SSIS ETL Tool, I would like to Know what are the Various Transformations And Functionalities are available in Google Data Flow. MS SSIS Tool Provide Easy Interface (Drag and Drop ) and SQL Usage to Perform ETL.
Data flow is mostly written in Python but how do you change or load only certain rows in a CSV/Text File when the particular field value is less than required amount ( Filtering of rows based on one field ) ?
Data flow Name is everywhere(Online) but why No Documentation of Data Processing Examples is Available?
If you Know any online course (other than coursera ) or Book with Practical and Hands-on please share it.
Thank you

Dataflow is a managed Apache Beam service, so general Beam quickstarts apply, just set the runner to DataflowRunner.
Here you can find a summary of available transforms (including the Filter you’d use for the example of filtering rows you mentioned) for the Java SDK. Beam is also available for Python and for Golang, but the Java API is the most mature of the three.
Also, if you want a graphical interface more similar to Microsoft SSIS, you may want to look into Dataprep, which is built on top of Dataflow and does provide some more interactive features.

Related

Is it possible to stream IGenericEvents rather than buffering?

Using the Trace Processing library, is it possible to use streaming as described here to parse IGenericEvents from a .etl file?
(I am a developer at Microsoft who works on the TraceProcessor project.)
That is not possible with the current implementation. It is meant to be transparent/irrelevant to users but with non-streaming data sources in TraceProcessor, some of the data is made available by parsing the ETW events and state models directly in TraceProcessor, and the other data is made available in the form of a .NET/managed TraceProcessor projection on top of native/C++ ETW processing done by Xperf. Likewise, the current implementation of the Windows Performance Analyzer (WPA) uses TraceProcessor as the data source for some of its tables, and Xperf as the source for the others.
In order to support streaming access in the current implementation of TraceProcessor, a data source has to be written both A) entirely within TraceProcessor (i.e. not in Xperf), and B) specifically to support streaming. We have typically only added this support when adding new data that wasn't already available in Xperf, or when we had other reasons to do a major rewrite of a data source.
Generic event support in TraceProcessor is currently built on top of the Xperf support, in part because there is some complicated logic required to parse the schemas for the event fields in one pass of the trace, and then populate the IGenericEvents in the next pass.
We don't currently have plans to invest in a streaming version of generic events, but if you are particularly interested, you could create an issue in our issues repo on Github and we could keep you posted if plans change.

How to develop data visualization for time series database?

I have a time series database for which I have to build a front-end. The front-end needs to contain some panels and graph on the basis of the data provided to it from database. Can anyone suggest me a tool which I can use to build the front-end and then connect with the database (InfluxDB)?
You have two popular options as follows:
Chronograf: The 'C' letter of the TICK Stack. It offers a complete dashboarding solution for quickly visualising the data that you have stored in InfluxDB. You can run queries and build alerts easily. It is very simple to use and includes several templates. However, it does not have plugin support to extend its visualisation capabilities. For example, there is no a pie chart and traffic light visualisation out-of-the-box.
Grafana: This is the other popular option. It provides the similar beautiful dashboarding capabilities as Chronograf, but in contrast, you can use it to visualise the data that not only stored in InfluxDB but also other data stores such as Prometheus, Elasticsearch, Graphite, etc. It allows you to install community developed plugins to extend visualisation capabilities of the solution.
In summary, both are very good and can be easily integrated to InfluxDB. I find Grafana more flexible and feature rich. Grafana is used more widely so you may find answers for specific questions easier. I also recommend that you have a look at issues in both solutions' GitHub account before coming to a conclusion.

Can I link to Business Objects universe in R?

Is it possible to link to a Business Objects Universe in R?
I can connect R to a SQL database but that doesn't give me the Universe where all the joins are already created. I have end users that currently get there data through Business Objects but R would better suit their needs. And, since they are already familiar with field names and how the tables are linked in Business Objects, I would like for them to just link R to the existing BOBj universe.
Thanks!
Since I'm not allowed to recommend specific software (new to me, but fair is fair :)) I'll just mention two options:
1) search the internet for software that gives odbc access to business objects universes. These are good keywords on my favorite search engine: 'odbc leverages SAP BusinessObjects Security'
It has the distinct advantage of making sure that all user access is being governed by business objects, but data will have to travel through that server, and you will need to pay for licenses to SAP as well as the vendor of the new odbc...
2) give the users in question ODBC access to the same tables that the BO reports/universe run against, and make sure they are allowed to see the SQL in their dataproviders. Then tell them to 'cook up a report, and the copy&paste the SQL to R'
Option two may save you some licenses and give you better performance, but I would personally go for option 1) :)

Need solution for report generation mechanism

I have one issue. Actually we have one spring-mvc web based application , for which we want report generation mechanism. I came across Talend ETL. Can anyone tell whether to use talend as report generation mechanism will be fruitful. Can i integrate it with my application ? Or whether i should search for some jar that can help in fast report generation mechanism.
Thanks
The question is vague, but let me try to answer anyway. Talend is not a reporting platform, but an ETL (read: data handling and transformation) tool.
You can embed a TOS job in your application, and it'sadvisable to do if you need to handle data in a medium/complex way without reinventing the wheel (ie. read an Excel file, do some things, save on DB...). But use it as reporting or data visualization platform would be a pain in the neck.
There're better embeddable solutions for these duties. Birt and JasperReports come in mind, but there's plenty of them. The real question for choosing from is: do you need a low-level reporting service, not much more than a framework at the end of the day, or a polished, maybe client-server, solution to query as a service?

Getting data out of PeopleSoft

We have a PeopleSoft installation and I am building a separate web application that needs to pull data from the PeopleSoft database. The web application will be on a different server than PeopleSoft, but the same internal network.
What are my options?
This one's an oldie but it may still be of interest.
PeopleSoft has it's own schema within the host database (Oracle, SQL Server, DB2 etc) which are the PSxxx tables, eg: PSRECDEFN is the equivalent of Oracle's DBA_TABLES. These tables should not be touched by any external code. The application tables are stored in PS_xxx tables, eg: PS_JOB. These tables can be read and updated by any SQL code.
Many batch programs in PeopleSoft (eg: Application Engines, COBOL or SQRs) access the tables directly, and this is the fastest way to get data into or out of the database. However PeopleSoft has quite a rich application layer which is bypassed when doing direct SQL. This application layer must be replicated in direct SQL code, especially for inserts or updates. There may be updates to other tables, calculations or increments of database-stored counters.
To determine how to do this one must look through the PeopleCode (a VB6-like interpreted language), page design (via Application Designer) and use the PeopleCode and SQL trace tools. These days the application layer is huge, so this can be a lengthy task for non-trivial pages. PeopleSoft groups related pages into "Components", and all pages in the component are saved at the same time.
Component Interfaces were introduced with PeopleTools 8 as a means to avoid doing all of this. Using a generator within the PeopleSoft app designer, a Component Interface is generated based on the component. For many components these can be used to access the pages as a user would, and can be accessed via PeopleCode programs, and therefore via App Engine programs and via the Integration Broker. They can also be wrapped in Java code and access directly by code able to execute against the app server with a web service wrapper. This method is best for low-volume transactions: heavy extracts work better with native SQL.
The online development and tracing tools in PeopleSoft are pretty good, and the documentation is excellent (although quite extensive) and available on: http://download.oracle.com/docs/cd/E17566_01/epm91pbr0/eng/psbooks/psft_homepage.htm
If you are just looking at bringing out data from a given Component, the easiest way would be to turn on the SQL trace (under the utilities menu in PeopleSoft) and bring up some records for the Component. Wading through the trace file will give you a good idea of what to do, and much of the SQL could be cut and pasted. Another method would be to find an existing report that is similiar to what you are trying to do and cut out the SQL.
Have a PeopleSoft business analyst on hand to help you develop the requirements wouldn't hurt either.
Yes - Integration Broker is Peoplesoft's proprietary implementation of a publish/subscribe mechanism, speaking xml. You could of course just write code that goes against your database using JDBC or OLE/ODBC. Nothing keeps you from doing this. However, you must understand the Peoplesoft database schema, so that you are pulling from, or inserting/updating/deleting all of the proper data. Peoplesoft takes care of this for you.
Also, check out Component Interfaces - and they are exposed as an API to Java or C/C++.
I guess it depends on your requirement, and which version of PeopleSoft you're on.
Do you want real-time lookup? If that's the case then you'll want to look at Web Services/Integration Broker.
If you want a batch/bulk export then a scheduled App Engine would do the trick.
The best way is to use Integration Broker (IB) services to expose the PeopleSoft database data to external applications. The external application will be able to access the PeopleSoft IB services as XML over HTTP, thus allowing you to use any widely used XML parsers for this purpose.
The problem with component interfaces as opposed to Integration Broker is that component interfaces tend to be much slower than direct DB access from within IB service PeopleCode. Also future additions to the component attached to the component interface sometimes tend to 'break' the interface.
For more details on PeopleSoft Integration broker, you can access the online documentation at http://docs.oracle.com/cd/E26239_01/pt851h3/eng/psbooks/tibr/book.htm
Going directly to the database means you have to re-create the presentation logic... see my longer answer above. You can do this for simple pages but otherwise using a component interface is the way to go.
You can also write a sqr process for bulk data extraction. SQR will create the output file which the other application can pick. SQR would be faster than the application engine programs as it performs most of the operations in memory.

Resources