Is it possible to connect to SAS Data Warehouse (DB2) using R? - r

Currently, using SAS Enterprise Guide, we have some code that pulls data from a data warehouse that has a seemingly straightforward 'CONNECT TO DB2(x,y,z,)' statement in a PROC SQL where x=database name, y=user ID, z=password.
Looking in Tools > Connections > Profiles, I can find the host name and port for the connection. I'm trying to see if there's a way to use this information (and find any other needed information) to connect to the same data warehouse using R. Other posts here on SO have some code using JDBC or RODBC, but I wasn't able to get anywhere with that, as they mention drivers (which I don't know anything about) and the Oracle folder in my C drive didn't seem to have any of that information.
After reaching out to someone involved with the warehouse, their response was "it is a SAS data warehouse and not accessible via direct ODBC connections."
I'm not too sure what other information to ask for or provide for this, but do any of you know if what I'm looking to do is possible? Are SAS Data Warehouses structured some way that would prevent me from accessing it in R? If I can, what else do I need besides the host name, port, database name, user ID, and password?
Thanks in advance, I'm pretty new to all of this

Related

Kafka Connector for Oracle Database Source

I want to build a Kafka Connector in order to retrieve records from a database at near real time. My database is the Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 and the tables have millions of records. First of all, I would like to add the minimum load to my database using CDC. Secondly, I would like to retrieve records based on a LastUpdate field which has value after a certain date.
Searching at the site of confluent, the only open source connector that I found was the “Kafka Connect JDBC”. I think that this connector doesn’t have CDC mechanism and it isn’t possible to retrieve millions of records when the connector starts for the first time. The alternative solution that I thought is Debezium, but there is no Debezium Oracle Connector at the site of Confluent and I believe that it is at a beta version.
Which solution would you suggest? Is something wrong to my assumptions of Kafka Connect JDBC or Debezium Connector? Is there any other solution?
For query-based CDC which is less efficient, you can use the JDBC source connector.
For log-based CDC I am aware of a couple of options however, some of them require license:
1) Attunity Replicate that allows users to use a graphical interface to create real-time data pipelines from producer systems into Apache Kafka, without having to do any manual coding or scripting. I have been using Attunity Replicate for Oracle -> Kafka for a couple of years and was very satisfied.
2) Oracle GoldenGate that requires a license
3) Oracle Log Miner that does not require any license and is used by both Attunity and kafka-connect-oracle which is is a Kafka source connector for capturing all row based DML changes from an Oracle and streaming these changes to Kafka.Change data capture logic is based on Oracle LogMiner solution.
We have numerous customers using IBM's IIDR (info sphere Data Replication) product to replicate data from Oracle databases, (as well as Z mainframe, I-series, SQL Server, etc.) into Kafka.
Regardless of which of the sources used, data can be normalized into one of many formats in Kafka. An example of an included, selectable format is...
https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/tasks/kcopauditavrosinglerow.html
The solution is highly scalable and has been measured to replicate changes into the 100,000's of rows per second.
We also have a proprietary ability to reconstitute data written in parallel to Kafka back into its original source order. So, despite data having been written to numerous partitions and topics , the original total order can be known. This functionality is known as the TCC (transactionally consistent consumer).
See the video and slides here...
https://kafka-summit.org/sessions/exactly-once-replication-database-kafka-cloud/

Efficient method to move data from Oracle (SQL Developer) to MS SQL Server

Daily, I query a few tables in SQL Developer, filtering to prior day activity, adding column to date stamp the data, then export to xlsx. Then I manually import each file to a MS SQL Server via SQL Server Import and Export Wizard. Takes many clicks, much waiting...
I'm essentially creating an archive in SQL Server, the application I'm querying overwrites data daily. I'm not a DBA of either database, I use the archived data to do validations and research.
It's tough to get my org to provide additional software, I've been trying to make this work via SQL Developer, SSMS Express ed, and other standard tools.
I'm looking to make this reasonably automated, either via scripts, scheduled tasks, etc. Appreciate suggestions that would work on my current situation, but if that isn't reasonable, and there's a very reasonable alternative, I can go back to the org to request software/access/assistance.
You can use SSIS to import the data directly from Oracle to SQL Server, unless you need the .xlsx files for another purpose. You can also export from Oracle to these, then load to SQL Server from these files if you do need the files. For the date stamp column, a Derived Column can be added within a Data Flow Task using the SSIS GETDATE() function for a timestamp in order to achieve the same result. This function returns a timestamp, and if only the date is necessary the (DT_DBDATE) function can cast it to a date data type that's compatible with this data type of SQL Server. Once you have the SSIS package configured, you can schedule in to run at regular intervals as a SQL Agent job. I'd also recommend installing the SSIS catalog (SSISDB) and using this the source to run the packages from. The following links shed more light on these areas.
SSIS
Connecting to Oracle from SSIS
Data Flow Task
Derived Column Transformation
Creating SQL Server Agent Jobs for SSIS packages
SSIS Catalog
Another option that you may consider (if it is supported in SQL Express) is using the BCP utility, which can be run from command line.
The BCP utility allows you to bulk copy the data from a delimited text file into a SQL Server table.
If you go this approach, things to consider:
Number of Columns in the source file need to match the number of columns in the destination
Data types must match (or be comparable)
Typically, empty strings will be converted to nulls, so you will need to consider if the columns are nullable.
(to name a few - if you want to delve deeper, you might also need to look at custom delimiters between fields and records. Don't forget, commas and line feeds are still valid characters in char type fields).
Anyhow, maybe it will work for you, maybe not. Sure, you might still have to deal with the exporting of the data from Oracle, but it might ease the pain getting the data in.
Have a read:
https://learn.microsoft.com/en-us/sql/tools/bcp-utility?view=sql-server-2017

r - SQL on large datasets from several access databases

I'm working on a process improvement that will use SQL in r to work with large datasets. Currently the source data is stored in several different MS Access databases. My initial approach was to use RODBC to read all of the source data into r, and then use sqldf() to summarize the data as needed. I'm running out of RAM before I can even begin use sqldf() though.
Is there a more efficient way for me to complete this task using r? I've been looking for a way to run a SQL query that joins the separate databases before reading them into r, but so far I haven't found any packages that support this functionality.
Should your data be in a database dplyr (a part of the tidyverse) would be the tool you are looking for.
You can use it to connect to a local / remote database, push your joins / filters / whatever there and collect() the result as a data frame. You will find the process neatly summarized on http://db.rstudio.com/dplyr/
What I am not quite certain of - but it is not a R issue but rather an MS Access issue - is the means for accessing data across multiple MS Access databases.
You may need to write custom SQL code for that & pass it to one of the databases via DBI::dbGetQuery() and have MS Access handle the database link.
The link you posted looks promising. If it doesn't yield the intended results, consider linking one Access DB to all the others. Links take almost no memory. Union the links and fetch the data from there.
# Load RODBC package
library(RODBC)
# Connect to Access db
channel <- odbcConnectAccess("C:/Documents/Name_Of_My_Access_Database")
# Get data
data <- sqlQuery(channel , paste ("select * from Name_of_table_in_my_database"))
These URLs may help as well.
https://www.r-bloggers.com/getting-access-data-into-r/
How to connect R with Access database in 64-bit Window?

How to validate data in Teradata from Oracle

My source data is in Oracle and target data is in Teradata.Can you please provide me the easy and quick way to validate data .There are 900 tables.If possible can you provide syntax too
There is a product available known as the Teradata Gateway that works with Oracle and allows you to access Teradata in a "heterogeneous" manner. This may not be the most effective way to compare the data.
Ultimately what your requirements sound more process driven and to be done effectively would require the source data to be compared/validated as stage tables on the Teradata environment after your ETL/ELT process has completed.

Pull Sybase data into SQL Server

I have an ASP.NET app that uses a SQL Server database. I now need to pull data from Sybase ASE into that SQL Server database for my app to consume, and I'm not having any success with my ideas.
Has anyone done this? Any ideas/suggestions/tips?
You can configure a linked server from SQL Server to Sybase. It should be fairly vanilla using the Sybase provider on the MS side.
Okay, I've finally (through lame trial and error) found out how to link my Sybase ASE (12.5) server to my SQL Server (2008) which will allow the integration I want. Here's roughly how I did it:
Logged in to Sybase ASE OLE DB Configuration Manager (this is like the Sybase version of Windows' ODBC Data Sources) and added an OLE DB data source. I believe you must be an admin on the PC to do this.
In SQL Server 2008 Management Studio, went to Server Objects > Linked Servers. Right click and select "New Linked Server".
In the Linked Server Properties, I set the following properties:
General:
--Linked server: the name of your linked server as you want it to appear in your linked server list
--Provider: Select Sybase ASE OLE DB Provider from the dropdown list.
--Product name: The exact name of the OLD DB data source you just created in Sybase ASE OLE DB Configuration Manager.
--Data source: Same as Product name.
--Provider string: I left this blank
--Location: I left this blank
--Catalog: The default database (master or whatever) to log on to.
Security:
--You need to map a valid SQL Server logon to a valid Sybase logon. I did not use impersonation (which does a credentials pass-thru).
--I chose my connection Be made without using a security context.
Server Options:
--All the defaults worked for me.
Throughout, the standard SQL Server help worked fairly well as a guide. Though not always true, F1 was my friend here.
I can now do distributed queries, DTS or SSIS packages, and use SSRS. This takes a lot of the suck out of Sybase ASE.
Of course the above can be done via the command line using sp_linkserver, but the GUI is more comfortable for a lowly dev like me.
Use Management Studio or Enterprise Manager to import the data using the data importation wizard. That should be it, just make sure you pick the right data provider in the wizard and you should be good to go.
If you want this to be a live feed create a small windows service to manage the exchange of information. It should be relatively simple to do, just a little bit of leg work on your end. If you are adverse to that there are plenty of off the shelf solutions that can do this for you.
The question is a little vague on specifics:
Is this a one time conversion or part of a repeated process.
Is the source machine "reachable" from your destination machine (can you connect the two or do you need to read in files)
With most conversions there are two parts:
Physically getting data from the source into the destination.
Mapping data from the source to the destination tables.
It is hard to make any recommendations without more info. What would be fine for a one time conversion would not work if you need to read in data all day every day. Also, if the source database can not be connected to and you have to pass files, they methods change.

Resources