Can we compare the contents of two folders in spotfire? - unix

I have two environments. One is development and another is production. Lets say I have folder in production which has all my metadata like ILs, joins, DS, Analysis, scripts etc. Now in development I have the same folder but with new enhancements done.
Now, I want to compare that what are the changes that have been done and as per the result I will be able to understand the impact.
So, could you please tell me that what is way to compare that two folders of development and production environment?

For the requirements posted here, you can create information link on top of LIB_ITEMS table to fetch details of library item details from Spotfire database. An another set of activity is performed at this link, but approach can be used for your requirements as well.

Related

Version control of big data tables (iceberg)

I'm building a Iceberg tables on the top of a data lake. These tables are used for reporting tools. I'm trying to figure out what is the best way to control a version/deploy changes to these tables in CI/CD process. E.g. I could like to add a column to the Iceberg table. To do that I have to write a ALTER TABLE statement, save it to the git repository and deploy via CI/CD pipeline. Tables are accessible via AWS Glue Catalog.
I couldn't find to much info about this in google so if anyone could share some knowledge, it would be much appreciated.
Cheers.
Version control of Iceberg tables.
Agree with #Fokko Driesprong. This is a supplement only.
Sometimes, table changes are considered as part of task version changes. That is, table change statements, ALTER TABLE, are bound to task upgrades.
Tasks are sometimes automatically deployed. So it often executes a table change statement first, and then deploys a new task. If the change is disruptive, then we need to stop the old task first and then deploy the new one.
Corresponding to the upgrade, we also have a rollback script, of course, the corresponding table change statement.
thanks for asking this question. I don't think there is a definitive way of doing this. In practice I see most people bundling this as part of the job that writing to the Iceberg table. This way you can make sure that new columns are populated right away with the new version of the job. If you don't do any breaking changes (such as deletion of column), then the downstream jobs won't break. Hope this helps!

Duplicate column issue in migration scripts in .net microservice module. I have to resolve duplicate column in my migration which already executed

In .net core microservices, another team works on open source modules, and I extend their modules in my project. I already added one column in a Entity and then same column is added by open source team. Now duplicate column error is showing.
I can not alter open source migration files and my column is already in production.
How to resolve this issue please suggest.
i think we are dealing with multiple smells here.
each microservice should have its own Datarealm
by the sound of it you extended a Table of an object which was generated by the opensource service
But this does not help you, the question is can you merge the Data which is in this column.
If so you might get away with a simple copy/update.
What you need todo is:
Create a new Column With a different Name
Copy your existing Data into the new column
Drop your column
Execute your Migration
Copy your Data back into the column
Test your application very carefully, if the logic which you have implemented for the from you generated Data works as intended
Drop your backup column
Depending on the amount of Data this will lead to a downtime, so plan ahead and have a rollback strategy ready if something goes wrong.
Personal Opinion to Prevent those Smells
Every time i needed an opensource project in my projects in the past i wrote a wrapper around it, this has multiple benefits.
For one if the api of the project changes you only have to update it in exactly one place, which improves the maintainability.
Because it has a wrapper it automatically gets a own Database and if i need to extend an entity which i get from one of those opensource projects, i usually do it via Foreign Keys with a different Table. Which then gets linked via a view.
Yes this costs some performance, but in the end it was worth it every time.

What is the best practice for transferring objects across R projects?

I would like to use R objects (e.g., cleaned data) generated in one git-versioned R project in another git-versioned R project.
Specifically, I have multiple git-versioned R projects (that hold drake plans) that do various things for my thesis experiments (e.g., generate materials, import and clean data, generate reports/articles).
The experiment-specific projects should ideally be:
Connectable - so that I can get objects (mainly data and materials) that I generated in these projects into another git-versioned R project that generates my thesis report.
Self-contained - so that I can use them in other non-thesis projects (such as presentations, reports, and journal manuscripts). When sharing such projects, I'd ideally like not to need to share a monolithic thesis project.
Versioned - so that their use in different projects can be independent (e.g., if I make changes to the data cleaning for a manuscript after submitting the thesis, I still want the thesis to be reproducible as it was originally compiled).
At the moment I can see three ways of doing this:
Re-create the data cleaning process
But: this involves copy/paste, which I'd like to avoid, especially if things change upstream.
Access the relevant scripts/functions by changing the working directory
But: even if I used here it seems that this would introduce poor reproducibility.
Make the source projects into packages and make the objects I want to "export" into exported data (as per the data section of Hadley's R packages guide)
But: I'd like to avoid the unnecessary metadata, artefacts, and noise (e.g., see Miles McBain's "Project as an R package: An okay idea") if I can.
Is there any other way of doing this?
Edit: I tried #landau's suggestion of using a single drake plan, which worked well for a while, until (similar to #vrognas' case) I ended up with too many sub-projects (e.g., conference presentations and manuscripts) that relied on the same objects. Therefore, I added some clarifications above to my intentions with the question.
My first recommendation is to use a single drake plan to unite the stages of the overall project that need to share data. drake is designed to handle a lot of moving parts this way, and it will be more seamless when it comes to drake's decisions about what to rerun downstream. But if you really do need different plans in different sub-projects that share data, you can track each shared dataset as a file_out() file in one plan and track it with file_in() in another plan.
upstream_plan <- drake_plan(
export_file = write_csv(dataset, file_out("exported_data/dataset.csv"))
)
downstream_plan <- drake_plan(
dataset = read_csv(file_in("../upstream_project/exported_data/dataset.csv"))
)
You fundamentally misunderstood Miles McBain’s critique. He isn’t saying that you shouldn’t write reusable code nor that you shouldn’t use packages. He’s saying that you shouldn’t use packages for everything. But reusable code (i.e. code that you want to reuse) absolutely belongs in packages (or, better, modules), which can then be used in multiple projects.
That being said, first off, pay attention to Will Landau’s advice.
Secondly, you can make your RStudio projects configurable such that they can load data based on paths given in a configuration. Once that’s accomplished, nothing speaks against hard-coding paths to data in different projects inside that config file.
I am in a similar situation. I have many projects that are spawned from one raw dataset. Previously, when the project was young and small, I had it all in one version controlled project. This got out of hand as more sub-projects were spawned and my git history got cluttered from working on projects in parallel. This could be to my lack of skills with git. My folder structure looked something like this:
project/.git
project/main/
project/sub-project_1/
project/sub-project_2/
project/sub-project_n/
I contemplated having each project in its own git branch, but then I could not access them simultaneously. If I had to change something to the main dataset (eg I might have not cleaned some parts) then project 1 could become outdated and nonfunctional. Once I had finished project 1, I would have liked it to be isolated and contained for reproducibility. This is easier to achieve if the projects are separated. I don't think a drake/targets plan would solve this?
I also looked briefly into having the projects as git submodules but it seemed to add too much complexity. Again, my git ignorance might shine through here.
My current solution is to have the main data as an R-package, and each sub-project as a separate git-versioned folder (they are actually packages as well, but this is not necessary). This way I can load in a specific version of the data (using renv for package versions).
My folder structure now looks something like this:
main/.git
sub-project_1/.git
sub-project_2/.git
sub-project_n/.git
And inside each sub-project, I call library(main) to load the cleaned data. Within each sub-project, a drake/targets plan could be used.

Association of any kind inbetween QC projects?

I want to associate one QC project with another (e.g., manual testing and automation testing). I use QC 11.00
I would like to know what kind of association there can be between two QC projects (on the same domain), so I do not have to maintain two projects and then copy paste what I need e.g. common repositories etc.
I'm not sure that you can do this. A project in QC is supposed to be a self-contained entity, that is, there is no way (that I know of) that you can automatically move data between projects.
Sure, you can copy and paste data, as well as create a project with another one as base, but that is probably not what you want.
I would rather have manual testing and automation in the same project, which makes more sense I think. The point is that the project is supposed to identify the test object, rather than the test methodology - the latter can be done better in Test Plan where you specify a Test Type when you create your test.
This way, you will have all defects and test reports for your test object in the same project which will make it all the easier to track what is going on.
As a general rule; you would want to keep all project data for one project in that project; and, you want project data from that project to be unique and separate from all other projects.
That being said... if you really wanted to do this (and were able to convince a QC subject matter expert that it was a good idea?), then it should be a relatively simple matter to amend the workflow with additional code to interface with another project.

I'm working with Drupal and I want to diff between two databases, to see which tables have been updated in a given session. how do I do this?

I'm working with Drupal on a project, trying to find a way to speed up tests (we're using Cucumber and Selenium), and I'm trying to see which tables have been changed in a given series of steps, so I can just revert dump and out reset those tables between each test case.
Right now, Simpletest, the Drupal testing framework works by installing and setting up the tables for every module needed for a test, which makes for slow tests, and I'm emulating a similar approach by loading a db dump for each test.
Given that a site, if you're doing integration testing has a 'known good' state to be starting from, I think it would be faster to be able to just revert back to that point each time, instead of waiting twenty seconds or so to drop the database then pipe the dumpfile back in between each test runs.
However, when I try diffing between two dumpfiles (ie before.I.create.a.node.sql, and after.I.create.a.node.sql) the output is an unreadable load of serialised php, that I can't make sense of.
Ae there any tools I can use to help work out which tables I need to drop and rebuild between test cases, so I don't incur the 20 second hit on each test, short of reading the schema and code of every module I'm working with?
I'm following the ideas outlined here with getting cucumber to work with PHP, and yes, I have seen [this question here on a similar subject
Thanks!
Drupal does store a lot of serialized PHP in the database. But the main part of it is kept in the cache tables; like cache, cache_field, cache_menu, etc and you can safely truncate these before dumping the database.
If you have any simpletest tables you could drop those. They are all temporary and is used only for running your Simpletest test suite.
That should reduce the dump size a lot. If it's not enough I can recommend reading up on the tables in the book Pro Drupal Development, or you could skim through the .install files to read the module's schema definitions. Though most will probably be real data you'd want to revert between tests.
Because of the relational nature of the database, be sure to either know exactly what you're doing or dump/revert all the remaining tables together.

Resources