Wednesday, May 29, 2019

Error Handling in Application Integration

Error Handling in Application Integration


Application integration not always comes with happy paths irrespective of domains and the mechanism adopted. More and more enterprises are adopting micro services that calls for integration of various applications in real time. Not always this integration result in successful data flow due to various reasons that includes business, application and network errors or limitations. Handling such failures in real time is critical to the functioning of systems and fulfill the assured delivery requirement. Understanding the types of errors/failures, processing, retrying and transforming provides higher rate of success.

The areas or activities that need to be looked into solve this problem are listed below:
  • Identifying the errors
  • Defining error categories
  • Formulating recoverable and non recoverable errors
  • Defining workflow steps - Automated or Manual 
  • Retry mechanism for recoverable errors
  • Persistence logic for long term recovery
  • Message reconstruction process
  • Defining manual intervention process

Usecase with REST API integration

Error Identification and Categorization


400 category : User Input Error(System related)
•401  - Authentication issue : Retry for getting fresh token. Even after fails, then send alert.
•404 – Retry logic required
•403 – May be one time retry and then Alert

500 category – Internal Server Error (Business validation related)

•For messages, it is important figure out all the codes from target systems.
•Important to find out if the error codes is common or specific to each type of data invalidation in each spoke. We have to start only from the codes. 


Recoverable and non recoverable errors

Errors like temporary network failures, application maintenance downtime can be recoverable. Data validation error requires transformation either automatically or manually depending on the complexity of business rules. Retry logic should be designed such a way to handle these cases individually. 

Retry mechanism for recoverable errors

Picture below illustrates Short term and long term retry logic. 



Circuit breaker design

The simple circuit breaker is used to with short term retry to avoid making the external call when the circuit is open, and the breaker itself should detect if the underlying calls are working again. We can implement this self-resetting behavior by trying the remote call again after a suitable interval, and resetting the breaker if it succeed. This also prevent the unexpected failures with remote calls. 


    



Persistence logic for long term recovery

The error along with the  message need to be temporarily stored in order to resend the message after the correction has been made to the integration flow. The data model should include: 

Configurable fields for Category Definition:
1. Retry attempt – Number
2. Frequency/duration – Number  - to show minutes
3. Alert required – varchar -  to store email
4. Manual Flag – int -  Manual, Auto, None
5. Priority  - int - High, Medium, Low

These are additional message fields:
1. Status – varchar or int to store Pending, Completed, On Going. – (Message Table)
2. Unique id (Trace ID) - required to identify the error apart from message  object so that we can treat each message differently. This has to be primary key.  (Message table)
3. ID column has to be oppty/LI id. 

Defining manual intervention process

There are errors that cannot be resolved automatically by program logic. This kind of error call for manual intervention. For example, if target system is expecting alpha numerical value for certain mandatory field and source system sends numerical value, the synchronization fails in this case. Unless the value is changed to meet the target system, it is not possible to make the flow successful. User intervention is required most of the time to correct the data or system. Every such cases needs be identified, processes to be well defined and messages handled accordingly before resending the message.   

Conclusion

Few other areas that need to be included are Scheduler for triggering the long term retry, message reconstruction process, and  alert/notifying mechanism. With these systems in place, it becomes easy and manageable to handle both expected and unexpected error scenarios. 
Application integration is never complete without robust error handling process and the benefit of that is enormous. Happy to assist further and take your feedback especially the improvements that you could think of. 

Sunday, March 24, 2019

Enterprise Integration - Hub & Spoke design


Various business domains in large enterprises using many applications, tools and utilities (ERP, CRM, Sales, HR, Finance etc.) poses bigger challenge to IT to support with effective collaboration across business domains, data analysis and critical data synchronization. With many such heterogeneous applications being added up over time integrating each of them point to point would increase the complexity exponentially. To reduce integration complexity, Hub and Spoke design paradigm is regarded as one of the best solution. The hub part needs to be designed to be highly concurrent, distributed, scalable, micro services oriented, cloud hosted and container orchestrated to offer all the elements of data integration like real-time data streaming, transformation, synchronization, quality, and management to ensure that information is timely, accurate, and consistent across heterogeneous applications.

Event driven design:
Spoke applications triggers messages to hub on its create or update event. Upon receiving such messages and based on the routing rules, hub decides message forwarding to corresponding spokes. Routing rules determines the message flow across spokes systems and these rules relies on trusted data sources for fetching data. The data validation, synchronization, transformation, enrichment, filtering for every message is carried out by integrating hub with trusted data sources. 

Tools & technologies for building hub using IBM public cloud : 
·       The real-time data pipeline is built using Kafka cluster that helps hub handle large amount of data and provides message reliability.

·       To simplify our concurrent code development, and avail infrastructure that allows us to scale without modifying application, Akka toolkit is used. Akka seamlessly handles the distribution of messages and communication in big scale. The other biggest advantage is that it simplifies concurrency logic which in turn improves coding efficiency.
·       Play web server provides lightweight, stateless web server framework and we have chosen this to expose APIs with minimal resource consumption.
·       Redis is used for caching and error handling.
·       Grafana and DataDog are used for monitoring infrastructure and services to update the administrator about system health.
·       Zipkin is used to trace the message flow across systems and provides real time update on the message transition.




Core components that Hub offers are:
·       API service: Exposes integration over REST to provide integration of spokes with hub. Here, interface contract is established to standardize the communication from spoke to hub. Provides HTTPS endpoint and OAuth2.0 for securing the communication.
·       Processor Service: Validates, standardize, enriches, filters and transforms the incoming data. Applies message routing rules to determine destination spoke for incoming messages.
·       Persistence Service: Provides data persistence to store every transaction that flows through the hub. DashDB is being used to store consolidated data.
·       Adapter Service: Service to consume the spoke interfaces is provided by adapter service. This generalized service gets derived by individual spoke adapter to be in compliant with its own interface.
·       Dashboard service:  Provides consolidated report on the opportunity messages. Helps support and administration staff with the message transaction details. 

Real time integration exception handling and assured delivery:
It is critical to handle extraneous conditions arising from unexpected events, invalid data and process errors at runtime. The ecosystem must adapt multi-pronged approach to handle various types of errors with assured delivery, error routing, error mapping, logging, notification, and dashboard monitors for error details. Robust retry framework is used to automatically resolve recoverable errors. Hub forks workflow process between recoverable and non-recoverable exceptions to manage them appropriately.
Circuit breaker mechanism is adopted to handle remote calls efficiently. This provides us lot of savings in terms of optimal resource usage and mitigates failure cascading across systems.