Wednesday, May 29, 2019

Error Handling in Application Integration

Error Handling in Application Integration


Application integration not always comes with happy paths irrespective of domains and the mechanism adopted. More and more enterprises are adopting micro services that calls for integration of various applications in real time. Not always this integration result in successful data flow due to various reasons that includes business, application and network errors or limitations. Handling such failures in real time is critical to the functioning of systems and fulfill the assured delivery requirement. Understanding the types of errors/failures, processing, retrying and transforming provides higher rate of success.

The areas or activities that need to be looked into solve this problem are listed below:
  • Identifying the errors
  • Defining error categories
  • Formulating recoverable and non recoverable errors
  • Defining workflow steps - Automated or Manual 
  • Retry mechanism for recoverable errors
  • Persistence logic for long term recovery
  • Message reconstruction process
  • Defining manual intervention process

Usecase with REST API integration

Error Identification and Categorization


400 category : User Input Error(System related)
•401  - Authentication issue : Retry for getting fresh token. Even after fails, then send alert.
•404 – Retry logic required
•403 – May be one time retry and then Alert

500 category – Internal Server Error (Business validation related)

•For messages, it is important figure out all the codes from target systems.
•Important to find out if the error codes is common or specific to each type of data invalidation in each spoke. We have to start only from the codes. 


Recoverable and non recoverable errors

Errors like temporary network failures, application maintenance downtime can be recoverable. Data validation error requires transformation either automatically or manually depending on the complexity of business rules. Retry logic should be designed such a way to handle these cases individually. 

Retry mechanism for recoverable errors

Picture below illustrates Short term and long term retry logic. 



Circuit breaker design

The simple circuit breaker is used to with short term retry to avoid making the external call when the circuit is open, and the breaker itself should detect if the underlying calls are working again. We can implement this self-resetting behavior by trying the remote call again after a suitable interval, and resetting the breaker if it succeed. This also prevent the unexpected failures with remote calls. 


    



Persistence logic for long term recovery

The error along with the  message need to be temporarily stored in order to resend the message after the correction has been made to the integration flow. The data model should include: 

Configurable fields for Category Definition:
1. Retry attempt – Number
2. Frequency/duration – Number  - to show minutes
3. Alert required – varchar -  to store email
4. Manual Flag – int -  Manual, Auto, None
5. Priority  - int - High, Medium, Low

These are additional message fields:
1. Status – varchar or int to store Pending, Completed, On Going. – (Message Table)
2. Unique id (Trace ID) - required to identify the error apart from message  object so that we can treat each message differently. This has to be primary key.  (Message table)
3. ID column has to be oppty/LI id. 

Defining manual intervention process

There are errors that cannot be resolved automatically by program logic. This kind of error call for manual intervention. For example, if target system is expecting alpha numerical value for certain mandatory field and source system sends numerical value, the synchronization fails in this case. Unless the value is changed to meet the target system, it is not possible to make the flow successful. User intervention is required most of the time to correct the data or system. Every such cases needs be identified, processes to be well defined and messages handled accordingly before resending the message.   

Conclusion

Few other areas that need to be included are Scheduler for triggering the long term retry, message reconstruction process, and  alert/notifying mechanism. With these systems in place, it becomes easy and manageable to handle both expected and unexpected error scenarios. 
Application integration is never complete without robust error handling process and the benefit of that is enormous. Happy to assist further and take your feedback especially the improvements that you could think of.