Connectivity Introduction — Part I

Virginie Grandhaye
7 min readFeb 18, 2022

--

As I recently celebrated my work anniversary, I’m proud to be part of IBM for now 13 years.

IBM is a big company, and some argue that it is a challenge in itself, but I see it the other way around : you have the opportunity to enjoy working on different projects (I moved for Engineering, to product manager, on plenty of different products within those amazing years…), and all along, I could meet awesome people.

For two years, I’m working as Connectivity Product Manager, for IBM’s Data Fabric, called #CloudPakForData. Let me introduce you to this great topic and propose a very humble and simplified topology or overview.

Data

Lets start from the basic…

The volume of data is exponentially increasing for the past 10 years or so. Companies are investing a lot in storage, as they need to park data somewhere (and storage is becoming cheaper). Data can be transient, operational, and you may decide to store them temporarily, or forever (depending on different reasons, including legal considerations). But in any case, Data storage is a center of cost, until you can extract value from them…

Storage

Data, depending on the nature of it, is stored in a different support : Databases, DataWarehouse, Data Lake…

Databases

Traditionally/historically, data were stored in Relational databases, and accessed via SQL Queries. Today, they are categorized as Single Data model databases (eg : MySQL, Ms SQL Server, MariaDB, PostgreSQL…). But a relational database is not appropriate to store unstructured data (like a video, a song, a document) so, emerged the notion on NoSQL databases (MongoDB, IBM Cloud Object Storage, AWS S3…).

The combination of both is the so called multi-data models databases (able to store and access the data in a relational way, but also using a key-value, graph, document or even time series). Eg : Oracle, CosmosDB, EnterpriseDB, MongoDB, SAP Hana…

Data Warehouse

The main use case for a data warehouse is probably the need for building dashboards (Business Intelligence). Indeed, data are stored in an optimal way, to be retrieved with high performance, for rendering in a graphic, pie chart, and execute drill down queries as a Business Analyst would need.

Eg : IBM Netezza Performance Server, Teradata, Google BigQuery, Snowflake, Amazon Redshift…

DataLake

Because of the variety of data and the emergence of data science to get value out of your data, companies wanted to make all the data available in a single storage. DataLake value proposition is really about getting all types of data within a single place, to allow data science. To name a few actors, Cloudera, Hortonworks, Azure all offer their DataLake solution.

Data in motion

All the above mentioned are static storages, meaning you store the data, and unless you remove them, they stay forever. But with the IoT (Internet of Things) emergence some while ago now, we also have data in-motion. We consider this type of storage as a new one for a few years. A good analogy could be the optic fiber which is connected to your house, and is used to bring your internet data to your house. You don’t necessarily need to know what is in the pipe, you cannot see the data, and if you could look inside, they would be different at two points in time (because they move). Some specific technologies handle this as a storage and have a different approach as to what interactions you can have with the data.

Eg : Kafka, Amazon SQS, IBM MQ, …

Datasource

Whatever the value you want to get from your data, it is usual to say that all the storage types mentioned above are considered as potential dataources when it comes to integrating them with an application. Meaning, you will use those storage systems, to access your data.

A datasource can be used as a source (you extract data from it), or as a target (you update the content with some new data).

I created this mind map to help navigate in the overall complexity of datasources, and highlight some of the current capabilities we have today in Cloud pak for Data. This is not the exhaustive list of what exists on the market neither the full list of availabilities in Cloud Pak for Data, but it is showing high level categories, to help you understand some of the complexity of connectivity.

Datasources overview

You can download the file, in case it is too tiny :

If you’re willing to know more on Cloud Pak for Data connections, please visit our documentation

If you’re willing to know more on Cloud Pak for Data connections, please visit our documentation

Cloud Pak for Data on prem

https://www.ibm.com/docs/SSQNUZ_4.0/cpd/access/data-sources.html

Cloud Pak for data as a Service

https://dataplatform.cloud.ibm.com/docs/content/wsj/manage-data/conn_types.html

Authentication

As said, data are stored. They need to be protected (you wouldn’t like everyone to have access to your bank account ). Hence on any type of storage, the administrator would define who is allowed or not to connect to the data. There are plenty of different types of authentication. It can be as simple as : each user has a “user name” and “password”. Those are specific to the database and were created only for that purpose. But most of the time, the so called ‘credentials’ are administered in a central place (LDAP, Vault, …), and can take the form of an encrypted Key or token. You may also have heard about Federated authentication, or SSO (stands for Single Sign On), which is the notion of reusing a credentials, initially defined for a specific system, and serving authentication to other systems (like when you use your Facebook or Gmail credentials, to connect to another application). In this case, authentication is allowed thanks to a token, that is passing the authorization to connect through a 3rd party application. Slightly different, you can use sometimes, an API key. A good analogy is the physical key, you use to open a door. When you book a room in a hotel, and go to the hotel, they assign you a key, to enter your room. The validity of the key depends on your stay duration. An API key gives access to a system, and can be retrieved either permanently, or for a limited period of time. Consider that some systems have “rotating” key management in place, meaning the key is valid for only 5, 10, 30 minutes, 24h… and each time you try to connect, you need to generate a new key.

Connectivity

This is a generic term that is about establishing a communication channel, setup the appropriate security and network requirements, management of permissions, between a datasource and an application, in my case, our Cloud Pak for Data platform. Any Data Fabric needs to be connected to the data, so that you can consider getting value out of your data. To establish this communication, we usually use some small utilities, called connectors.

Connector

A connector is a piece of software, that allows connecting to a datasource. Because of the variety of data types, and data storages, we usually have a connector per datasource type. Without going too technical, a connector would usually either leverage an API available at the datasource, or a driver. A driver is a piece of software, usually provided by the datasource vendor, that embeds all the good capabilities to allow interacting with the datasource. It can be as simple as managing the tables in a database, decide on the partitioning, or leverage specific authentication options (like generating an API key…).

In Cloud Pak for Data, some connectors are built natively for a datasource (and would be called as the datasource type they deal with, like DB2, Greenplum, SAP Hana…) or can be tied to a specific technology (like OData, Hive, HDFS, or Generic JDBC). We have a third category of connectors, that aim at building the glue between the different services deployed on the platform (and ensure interoperability between those). Indeed, Cloud Pak for Data covers all the use cases from data virtualization, data integration, datascience, up to business intelligence…. Let’s say you’ve prepared and integrated your data within a service, and want to reuse this for a different team or project in a different service, then we allow you to create a connection to this service (like Data Virtualization, Analytics Engine, …) Some of those directly connect to your services available and provisioned on IBM Cloud (Cloud Object Storage, DB2, …) or even on other cloud vendors like Azure or Amazon.

Complete set of properties of our Cloud Pak for Data connectors can be seen here :

https://api.dataplatform.cloud.ibm.com/v2/data_flows/doc/dataasset_and_connection_properties.html

Connected Data

In Cloud Pak for Data, once you’ve established a connection to a datasource, you’re able to retrieve data (being tables most of the time). Those data are called ‘connected Data’. This type of asset is specific to Cloud Pak for Data and is a combination of data table with a connection.

To learn more on Asset types available in Cloud Pak for Data, you can visit this page.

https://dataplatform.cloud.ibm.com/docs/content/wsj/getting-started/assets.html

My job as a product manager in IBM, is to prioritize the investments we’re doing for connecting Cloud Pak for Data to the various datasources, and I hope you now understand that there are hundreds of possible combinations…

Market research, customer satisfaction, technical deep dive, technology watch… is my day to day job and I love it since I never get bored 

--

--

No responses yet