Is there a way to create a single copy data pipeline that shares a single source data set and file system connection pointing to different drives? - azure-resource-manager

I'm attempting to deploy an azure data factory with a copy data pipeline that pulls files from one or more deployed / on-prem file system paths and dumps them in blob storage. The source file paths on the file system may span multiple different drives (e.g. - C:\fileshare1 vs D:\fileshare2) and may include network locations referenced via UNC paths (e.g. - \\localnetworkresource\fileshare3).
I'd like to configure a single local file system connection and source data set and just parameterize the linked service's host property. Then my pipeline would just iterate over a collection of file share paths and reuse the dataset and linked service connection. However, it doesn't look like there's any way to have the data set or pipeline provide the host information to the linked service. It's certainly possible to provide folder information from the pipeline and dataset, but that will be concatenated to the host specified in the linked service connection and therefore won't allow me access to different drives or network resources.
It was reasonably straightforward to do this by configuring separate linked service connections, data sets and pipelines for each distinct file share that needed to be included, but I'd prefer to manage a single pipeline.

Yes, you can parameterize linked service.
https://learn.microsoft.com/en-us/azure/data-factory/parameterize-linked-services
Currently, ADF UI only supports 8 kinds of linked service parameterization. But actually all the linked service are supported in ADF Runtime. You could use json code to do it.
Refer these two post here:
Azure Data Factory - Dynamic Account information - Parameterization of Connection
How to provide connection string dynamically for azure table storage/blob storage in Azure data factory Linked service

Related

"XXXX_Datacatalog" and "XXXX_Datacatlog_viewer" of file's connection source in WKC

I noticed that files in the WKC have two different sources of connection in their property. One is the Connection name of COS, like AAAAA_Datacatalog, and the other is almost the same but with "_viewer" behind, like "AAAAA_Datacatlog_viewer".What kind of condition will cause this difference? Are these two kind of files behave different in WKC?
The files stored in IBM Cloud Object Storage (COS) reflects which connection it came from. By default, we create two system COS connections for you as part of the Catalog. One has editor rights which is the first one and the other one has viewer rights. Based on how you add assets later on and which connection you use, it will appear under the path of that connection object.

How to encrypt User-Provided Service values in cloud foundry?

I am trying to encrypt my username and password on cloud foundry. Currently I am storing these values as a CUPS (VCAP_SERVICES).
SPRING_DATASOURCE_URL: jdbc:oracle:thin:#//spring.guru.csi0i9rgj9ws.us-east-1.rds.a‌​mazonaws.com:1521/OR‌​C
SPRING_DATASOURCE_USERNAME: UserAdmin
SPRING_DATASOURCE_PASSWORD: p4ssw0rd
SPRING_DATASOURCE_initialize: false
I want to encrypt it so it would show some type of token/encryption or UUID instead of my actual username and password. How would I be able to encrypt these values so that, when I look at my VCAP_SERVICES these values would not be exposed?
Example from Cloud Foundry Provided Service
VCAP_SERVICES=
{
cleardb: [
{
name: "cleardb-1",
label: "cleardb",
plan: "spark",
credentials: {
SPRING_DATASOURCE_URL: "jdbc:oracle:thin:#//spring.guru.csi0i9rgj9ws.us-east-1.rds.a‌​mazonaws.com:1521/OR‌​C",
SPRING_DATASOURCE_USERNAME: "UserAdmin",
SPRING_DATASOURCE_PASSWORD: "p4ssw0rd",
SPRING_DATASOURCE_initialize: "false"
}
}
]
As you can see VCAP_SERVICES above is exposed how can I encrypt it so that username and password is encrypted like example below
Desired output
Username: hVB5j5GgdiP78xCSV9sNv4FeqQJducBxXlB81090ozYB
Password: hVB523fff78xCSV9sNv4FeqQ341090324234fdfdsrrf
Depending on what do you want to archive you can use spring vault mentioned by you or external vault instance with hashicorp service broker https://github.com/hashicorp/cf-vault-service-broker to retrieve/store credentials within your application in a secure way.
As a side note - the Mongodb service credentials on the screenshot are not encrypted but randomly generated by the service broker.
Most importantly - you shouldn't store/provide service credentials in your application manifest, but obtain this credentials (for binded cloudfoundry services) by parsing environment variable VCAP_SERVICES.
https://docs.cloudfoundry.org/devguide/deploy-apps/environment-variable.html#VCAP-SERVICES
External services should be presented to cloud foundry apps via CUPS https://docs.cloudfoundry.org/devguide/services/user-provided.html
Since you appear to be using Spring already, you might want to look at Spring Cloud Config.
https://cloud.spring.io/spring-cloud-config/
For larger projects it makes it easy to externalize and manage your configuration. A common setup would be to store the configuration in Git (but there are other backends, including Vault), Spring Cloud Config then runs as a server and provides the configuration to your running applications (clients). Your application (client) doesn't need to do much beyond include the Spring Cloud Config dependencies and a couple lines of config. Settings are obtained automatically and integrated through the Environment and PropertySource abstractions, which makes for a very clean integration where you don't need to do a lot of work.
For smaller projects, it might be more overhead than you'd like. For starts, you have to run the Spring Cloud Server. If you only have one or two small apps, the resources to run the SCC server might be more than your app or apps total. Complexity would be another concern. Using SCC definitely adds some complexity and possible failure scenarios. You'd need to understand what's happening well enough to troubleshoot when there is a problem.
What you might want to consider instead for smaller projects is simply using user provided services in CF. These make for a central place to store your config settings (doesn't have to just be databases, could be keys and other things too). You can then bind these to your apps to expose the configuration settings to that app.
There is some security in that Cloud Controller manages who can access your services. I believe the information that Cloud Controller stores is also encrypted at rest, which is a plus. That said, information is exposed to your applications via environment variables (i.e. VCAP_SERVICES) so anything in the container that can see the environment variables will be able to read your settings.
Using user provided services is one step up on environment variables. Not really from a security stand point, but more from a management stand point. You create the user provided service once, and can then bind it to any number of apps. With env variables, you'd need to set those for every app. It's more tedious, and it's prone to typos. You can also put the service name into your manifest.yml file so it automatically binds to the app and still check that into source control. If you were putting env variables with sensitive info into your manifest.yml, you wouldn't want to check it into source control and you'd have to be a lot more careful with that file.

how to start Apache Jena Fuseki as read only service (but also initially populate it with data)

I have been running an Apache Jean Fuseki with a closed port for a while. At present my other apps can access this via localhost.
Following their instructions I start this service as follows:
./fuseki-server --update --mem /ds
This create and updatable in memory database.
The only way I currently know how to add data to this database is using the built-in http request tools:
./s-post http://localhost:3030/ds/data
This works great except now I want to expose this port so that other people can query the dataset. However, I don't want to allow people to update or change the database, I just want them to be able to use and query the information I originally loaded into the data base.
According to the documentation (http://jena.apache.org/documentation/serving_data/), I can make the database read-only by starting it without the update option.
Data can be updated without access control if the server is started
with the --update argument. If started without that argument, data is
read-only.
But when I start the database this way, I am no longer able to populate with the initial dataset.
So, MY QUESTION: How I start an in-memory Fuseki database which I can populate with my original dataset but then disallow further http updates.
(My guess is that I need another method to populate the Fueseki database that is not using the http protocol. But I'm not sure)
Some options:
Here are some options:
1/ Use TDB tools to build a database offline and then start the server read only on that TDB database.
2/ Like (1) but use --update to build a persistent database, then stop the server, and restart without --update. The database is now read only. --update affects the services available and does not affect the data in any other way.
Having a persistent database has the huge advantage that you can start and stop the server without needing to reload data.
3/ Use a web server to pass through query requests to the fuseki server and limit the Fuseki server to talk to only localhost. You can update from the local machine, external people can't.
4/ Use Fuseki2 and adjust the security settings to allow update only from localhost but query from anywhere.
What you can't do is update a TDB database currently being served by Fuseki.

Web Farm + XML file for data storage

We have a web farm with two web servers. We are using XML file (stored inside web application) for data storage and convert this XML file to Excel and then email the file once a new record is added to the XML file. We didn't realize that since we were on web farm we will be having two different XML files, one for each server. Moreover, these files will have different data.
Any suggestions how we can handle this situation.?
Ideas
Use a centralized database server to store XML file.
Store the file in a location that is common to both servers - e.g. network attached storage (note this will require you to update your application logic to manage/prohibit concurrent write access to the file)
Write a merge routine that merges the two files together before the conversion / mailing process.
Be careful with #2. You will be tempted to create a file share on server A that allows server B to access it. In this case, you will have created a single point of failure whereby if server A fails, server B is useless because it can't get to the xml file that is on A.
We finally decided to use IP address for redirecting to one server all the time and hence we wil lhave one XML file which will be emailed as Excel file.

Ideas on patterns for a data file based WCF REST web service

We have an existing, proprietary data processing application that runs on one of our servers and we wish to expose it as a web service for our clients to submit jobs remotely. In essence, the system takes a set of configuration parameters and one or more data files (the number of files depends on the particular configuration template, but the normal config is 2 files). The application then takes the input files, processes them, and outputs a single result data file (all files are delimited text / CSV or tab).
We want to now expose this process as a service. Based on our current setup and existing platforms, we are fairly confident that we want to go with WCF 4.0 as the framework and likely REST for the service format, though a SOAP implementation may be required at some point.
Although I am doing a lot of reading on SOA, WCF and REST, I am interested in other thoughts on how to model this service. In particular, the one-to-many relationship of job to required files for input. It seems pretty trivial to model a "job" in REST with the standard CRUD commands. However, the predefined "job type" parameter defines the number of files that must be included. A job type of "A" might call for two input files, while "B" requires 3 before the job can run.
Given that, what is best way to model the job? Do I include the multiple files in the initial creation of the job? Do I create a job and then have an "addFile" method where by I can then upload the necessary number of files?
The jobs will then have to run asynchronously because they can take time. Once complete, is it best to then just have a status field in the job object and require the client to regularly query the system for job status, or perhaps have the client provide a URL to "ping" when the job is complete?
We are only in the planning stages for the service, so any insights would be appreciated.
To model it for REST, think of resources. Are the files part of the job resource or are they seperate resources.
If they are seperate resources then I would have a method to upload them seperately. How they link is up to you - so you could have a way to associate a file to a job when you upload the file or do you have a way to create links (now treating links as individual resources too) between existing files and jobs.
If you files are not seen as seperate resources then I would have them inline with job, as a single create.

Resources