Tech Work: integration design

Showing posts with label integration design. Show all posts

Sunday, April 13, 2025

Anti Patterns for Data Integration Hub

CIOs across enterprises run various applications that generates, and transforms loads of data. The data thus generated cannot remain silo. Applications need to be integrated for passing data across to allow collaboration among cross functional teams. There are various designs that industry has devised over time to integrate such applications. One such popular architecture style is Hub and Spoke. Hub and Spoke architecture integrates various applications through a centralized hub.

Many patterns evolved over time to implement this design and most of them are popularly followed. With this article, I’d like to throw some light on anti patterns to raise awareness. These are some strict “Don’ts“ while integrating applications using a centralized hub.

Using as surrogate to existing data stores: Applications and databases in enterprise serve different purposes. Individual databases are setup for handling operational or analytical data. Integration Hub(IH) should not try to emulate such existing applications or data stores. Main purpose of IH should remain for just flowing data to destination applications. IH should not own spoke's responsibilities which would require domain expertise.

Forming tight coupling with source or destination systems : One of the main objective in enterprise applications landscape is to develop loosely coupled systems. It could be connections, data formatting, or protocols development involved in integration. Each area should be independent, modular and loosely coupled. Loosely coupled systems are easy to scale, remain flexible, and interoperable. Adopting event driven architecture is one best way to makes systems loosely coupled. These advantages in turn bring monetary benefits.

Hard-coded integration : Integration logic or configurations hard coded in the systems might provide near-term conveniences. The configuration embedding in the code approach seem easier and faster, but it end up creating a rigid and inflexible integration that is difficult to change, or reuse. For example, if the integration requirements or rules change, the code has to be modified and redeployed, which can introduce errors and downtime. To avoid this anti-pattern, use an externalized and declarative integration approach that can separate the integration logic and configuration from the code, and store them in a configuration file or a database.

Insufficient Data Validation : Not validating passing data at earlier stage, could build up data sync issues in multiple systems eventually. Even with event driven architecture, event schemas that lack validation makes integration systems more vulnerable to errors surfaced during development phase. Most often, application to be integrated are turn out to be heterogeneous.

Not Using Event/Message Replay : More often, receiving system requires data from past to catch up due to unintended delays. Replay feature being unavailable, leads to humongous operational effort to republish data from certain point.

Lacking data lineage mechanism: Data lineage provides detailed map of how data is ingested, transformed, and activated across the data pipeline. Without that, root cause analysis during incident resolution becomes harder. A complete data lineage implementation should include details about data sources, data transformations, data destinations, metadata, and dependencies between different data elements. This mechanism not only helps in tracing but also improves data quality.

Not Monitoring for key metrics : Integration systems should be planned to measure metrics in real time. It helps in measuring data volumes and assess trends over time. Sudden hikes and drops in volumes usually are symptoms of failed components. Detecting anomalies with proactive monitoring, helps save cost and reputation.

Poor Error Handling: No system can run error free. Every system should have fail safe design with capable error handling mechanism. It handles failure that can be caught and retried. When building assured delivery systems, not a single event or transaction are allowed to be ignored. Accurately pointing the error sources and preempting is absolute necessary. Event retries should become part of error handling component to ensure data is delivered to destination. System should adopt plans for short term, long term retries and also manual intervention when necessary.

Ignoring Scalability: Poor scalable integration systems negatively impact metrics like data volume, sync time, and data accuracy. Such poor performance affects business continuity. Integration hub components should have capacity planning to handle varying data volume, and resiliency to disaster recovery. Replicas running in different availability zones or data centers should be considered to handle infra failures.

Missing self serve provision : Centralized integration hub can empower businesses to establish a unified view of their data, breaking down data silos and promoting cross-functional data sharing and collaboration. Cross functional teams should own and be able to integrate systems without steep learning curve. Integration Platform should be developed in a way to allow this.

Delegating data security and governance responsibilities to Spokes: Although integration hub keeps data only in transit, it is still paramount to keep data in motion secure and governed for meeting compliance requirements. Features like access control, standard encryption, and confidential data handling, data masking and etc. should be well supported.

The above list serves only the most pressing anti patterns. I'm sure there could be many other ways that provide clarity on what "not to do" while developing integration hub. With ever changing marketing dynamics and technology landscape, technologists will be uncovering them to constantly improve the integration metrics.

Hope you find this article useful.

Wednesday, January 25, 2023

Processing Partial Valid Data in Real-time Application Integration

Introduction

Siloed applications across enterprise needs to be interconnected to leverage data produced in other trusted data source applications. This helps different business units collaborate better, reuse, analyze, make informed decision and ultimately add great value to their offerings. The process of synchronizing data across applications is crucial exercise in enterprises and significant effort spent by teams on building robust integration. There are many types of integration mechanisms followed in industry today to build robust systems interacting with one another. Still there are challenges that need to be addressed.

Data is tagged as invalid due to various reasons. It could happen in source, or in transit. Wrong way of processing, and network issues can make data invalid. Such invalid data causes real-time data dependent systems fail in delivering data to customers in time. The turnaround time to fetch the valid data in further attempts delays the reports delivery and analytics that in turn slows down making informed decisions. Im proposing a solution in this article to mitigate this issue.

Widely Used Approach

Let us first understand the commonly used sync mechanism. The main components that are part of any data synchronization are:

Data: Data is strategic asset to enterprises and comes in many forms. Structured, and unstructured being at the top of classification hierarchy, there are various other formats under this classification.
Source Application: A Data source which is trusted across enterprise
Middleware : Links two or more separate application. Provides common connection, orchestration, data transformation and mapping logic in integrating heterogeneous applications.
Target Application : Application that receive data from external sources and stores it.

Common approach in automated real-time integration system that suffers with time delay when encounters with invalid data :
Structured data is always subjected to validation testing for its each attribute when it migrated to new application. Receiving applications are designed and developed to reject whole record even single attribute in that record is invalid. Thus rejected record will come back to Source application where it has to be inspected and correction must be made either in automated way or manually. Manual intervention to correct the data consumes more time and error prone. This delay could make many reporting or analytics applications not depending on the invalid attributes loose time unnecessarily.

Step by step process:
1. Initial data load: Middleware receives the data and finds certain attribute empty
2. Handling invalid data: Middleware tries to find the data in other source systems if orchestration is enabled
3. Middleware processing: Middleware detects the record as invalid
4. Middleware rejects the data to source application
5. Data correction at source: Source application corrects the data and pushes it to middleware. This could take long time depending on the type of error. Most of the times, manual corrections require days to correct the data
6. Reconciled data in target system: Once corrected, data is pushed again to the middleware
7. Middleware validates and if valid, pushes the data to target application
8. Target application validates the data and if valid stores the data

Commonly used integration mechanism

Issues in Current Approaches

No integration is error free. Errors could happen due to various issues like computational factors, network issues, heterogeneity between applications and etc. These kind of errors can be minimized with careful design but cannot be eliminated as external factors play major role. Most of the integration systems build feature to handle invalid data as absence of such feature will lead to data mismatching, inaccurate computation, false reporting, mistrust, and huge cost. However, commonly followed standard approaches are not enough and still struggles with significant time delay while handling invalid data.

There are many challenges in achieving data sync across applications. Data is marked as invalid by receiving applications when it is incomplete, inconsistent, inaccurate or in wrong format.

Not every attribute in the rejected record may be required for every kind of reporting or analytics applications. Many types of reporting and computing can still run without certain attributes. Practice of rejecting whole record is unnecessary and adds latency to real time data synchronization systems. This in turn negatively impacts data dependent business in various ways.

My Solution Proposal

If you agree with the above stated issues, read further to understand my proposal. My solution involves design changes in middleware and target applications to accept record even when partial attributes are found invalid. The specific record with error needs to be stored in its present state. Meanwhile the error is notified to the source application with appropriate error message. The missed or wrong format data is imputed by middleware and sent to the target application so that it can allow data store. Valid attributes of the same record can still be used in further data process, computation or display to the users. Only the error attribute is flagged. This flag indicates that the particular attribute cannot be used until it gets corrected from the source application. Once the new attribute value coming from source application is validated, flag will be removed for that attribute. This way of allowing partial dataset to reside and get processed will help valid data available to computation without any unwanted delay.

High level component view

Invalid data flagging design

The flagging design idea has to be implemented to hold the invalid data with indicators on it.

1. Flagging invalid record to indicate which attribute failed to pass validation rules

2. Format would be : Name of the attribute. In case of multiple attributes, all those attributes names can be mentioned as comma separated values.

3. Flagging record to indicate what reports or process still can run with such invalid record

4. Flag should be used by reporting or processing logic to see if the attribute required is flagged for incorrectness

Design implementation

Scenario 1: With empty value

1. Initial data load: Middleware receives the data and finds an attribute empty

2. Handling invalid data: Middleware tries to find the data from other source applications if orchestration is enabled

3. Middleware processing: If value for the attributer is not found, middleware imputes the data by looking at historical data or predicting using AI

4. Middleware processing: Middleware flags this record as invalid marking the attribute

5. Middleware pushes the data to target application

6. Target system processing : Target application accepts the data along with flag

7. Record stored in target application: Target application stores record along with flagging the invalid attribute. This flagging helps the data consuming component not to use attribute with invalid value in computation or reporting.

8. Reconciliation by middleware: Middleware in parallel send the invalid record back to the source application asking for correction of the invalid attribute

9. Data correction at source: Source application corrects the data and pushes it to middleware

10. Reconciled data in target application: This corrected data without flag is pushed to the target application

11. Data correction at target application: Target application validates the data and removes the flag for attribute

Scenario 2: With wrong format or inaccurate value

1. Initial data load: Middleware receives the data and finds an attribute is having wrong data format

2. Handling invalid data: Middleware flags this record as invalid marking the attribute

3. Middleware processing: Middleware pushes the data to target application

4. Target application processing : Target application accepts the data along with flag

5. Record stored in target application: Target application stores record along with flagging the invalid attribute. This flagging helps the data consuming component not to use attribute with invalid value in computation or reporting.

6. Reconciliation by middleware: Middleware sends the invalid record back to the source application asking for correction of the invalid attribute. Also in parallel tries to get the data through other data sources.

7. Data correction at source: Source application corrects the data and pushes it to middleware

8. Reconciled data in target application: This corrected data without flag is pushed to the target application

9. Data correction at target application: Target application validates the data and removes the flag for attribute. This allows the reporting and analytics functionality to utilize the attribute for further processing.

Process Flow

Data Format

Conclusion

Following type of applications can benefit with the proposed design
1. Real time integration of applications

2. Integration of transactional systems where each data event needs to be captured and propagated to other system without delay
3. Structured data migration to other applications
4. Real time reporting and analytics applications which depends on external applications for data

5. Applications interacting in Hub and Spoke design pattern