Sunday, April 17, 2022

 India's Personal Data Protection Bill 2021 - Chapter-wise Summary for Techies



The Government of India(GoI) is in the process of framing comprehensive and specific legislation to protect personal data of its citizens. The Joint Parliamentary Committee(JPC) was formed in 2019 to study and constitute Personal Data Protection(PDP) bill for India. After two years, the report was tabled in Indian parliament by the committee with its recommendation to protect personal data of Indian citizens. 


Many countries already framed protection laws to safeguard the privacy of its citizens. Although, PDP is similar to other countries especially with GDPR (The General Data Protection Regulation by European Union), it is critical to understand the nitty-gritty of the bill to remain compliant while doing business in India. Global IT companies complying with various countries privacy laws can extend their implementation with minimal effort to comply with PDP once it is enacted and provides transition period. 


My intention in this blog is to highlight key recommendations of the JPC going through each chapters. I'm keeping it concise to help developers, designers, architects, and product owners to get quick summary of this bill. For more details, one can start referring the appropriate section from the PDP bill report link that I shared in the references section below. I'm referring the section numbers along side the clauses to help readers quickly refer them in the original JPC report. There are fourteen chapters in this bill explaining public policy on data protection and I'm only listing the key aspects that are important for IT professionals.



Chapter 1: Preliminary

This section contains official definitions, meanings, terms and scopes that subsequent chapters would be referencing. It is critical to go through this section without miss to understand the definitions. Important key words are:

Personal data, non personal data, sensitive data classification, data fiduciary(processor), authorities, data profiling and etc. One key highlight is that the provision of this bill applies to both personal and non personal data. Processing of Personal data includes collecting, storing, disclosing, sharing within the territory of India and also to those not present in the territory of India but carrying out business in India. 


Section-15 describes the definition of person as per this report. Having clarity on who all come under the definition of Person is a must. As per the report, Person can be individual, a Hindu undivided family, a company, a firm, an association of persons or a body of individuals, the state, and every artificial judicial person. 

 

Section-41 in this chapter lists what constitute sensitive personal data and it is important to remember while designing applications. The list includes : Financial data, Health data, sex life, Sexual orientation, biometric data, genetic data, transgender status, intersex status, caste/tribe, and religious belief.  


Note the various actors and their roles in this chapter. It has definitions for Data principal, Adjudicating officer, Consent Manager, Data Auditor, Data Fiduciary, Data Protection Officer, and Data Protection Authority of India.



Chapter 2: Obligation of Data Processor


These sections in the chapter states the methods for processor or fiduciary to get consent from data principal before collecting this data. It mandates disclosing of the purpose, extent, nature, categories, and storing period to collect data. 

The highlights in this section are that it enables data processor to share, and transfer the personal data as part of business transaction with below clauses: 

  • Disclose with whom the data will be shared. 
  • Provide contact details of data processor and data protection officer 
  • Right of data principal(person) to withdraw consent 



Chapter 3: Grounds for processing of personal data without consent


State allowed itself to collect and process personal data without consent for provisioning services, security, court order and treatment during medical emergencies. This is critical information for e-governance applications development team to optimize their data privacy design. 


One key highlight here is that it allows storing personal data if it is not sensitive in cases of employment by data processor. HR applications which usually required to store employee data could still continue to do that without employee consent. 

The section mentions other “reasonable purpose” which excludes consent are : prevention or detection of fraud, security, credit scoring, M&A, search engines and publicly available personal data.



Chapter 4: Personal data of children


Child right protection being the objective in this chapter mandates policy for parent/guardian consent. Profiling, tracking, behavioural monitoring, targeted advertising  or any other type of potential harm to the child due to violation of informational privacy is disallowed. Registration with the Data Protection Authority is a must for data fiduciaries collecting children's data.  



Chapter 5: Rights of Data Principal


This chapter talks about the rights of the data principal on his data mandating for Processor to provide information in clear and concise manner. 

It is important to understand how data principal can exercise his rights. Data processor can:

  • Ask Identities of data processor, categories of personal data 
  • Requesting Right to be forgotten
  • Nominate legal heir
  • Request appends to agreement terms
  • Right to correction and erasure 
  • Restrict or discontinue disclosure in case the purpose is no more served(20(1))

On the other hand the act allows data processor to provide justifications in case the request cannot be considered or it is not technically feasible(19 (2)b). It also lists that Data processor can charge fee to data principal for providing the information back to the requestor(21(2)).



Chapter 6: Transparency and accountability measures


Interesting chapter for IT fraternity where they can find more IT level details here. This chapter mandates processor to prepare published “privacy by design” policy to contain:

  •  Business and technical systems design and process
  •  Obligations
  •  Approaches to transparency in data processing   

This sections recommends Processor to have defined strategy for:

  • Encryption and de-identification process
  • Protect integrity of personal data
  • Prevent misuse
  • Notification and alert mechanism when data breach happens. Mandates notification issue within 72 hours of becoming aware of such breach

The bill in this chapter mandates data protection impact assessment 27(1) which should contain Appointment of data protection officer and lists the responsibilities of such role.  Bill expects continuously updated detailed documentation of privacy by design policy published in processor's websites. The documentation should contain:

  • Categories 
  • Purpose
  • Exceptional situations
  • Procedure for exercise of rights by Principal with contact detail and escalation process
  • Info on cross border transfers

It calls for data protection impact assessment 27(1) which should contain:

  • Detailed description of proposed processing operation
  • Assessment of the potential harm that may be caused to the data principal

As per the bill, this gets validated by Data Auditor who assigns a rating in the form of data trust score. 



Chapter 7: Restriction on transfer of personal Data outside India


Sensitive data may be transferred outside India but such data continue to be stored in India (33(1)). This brings huge impact to the IT side of the business where they have to ensure the data centre inside India is setup to store a copy of the data before it is transferred outside the country. 


Another highlight is that central govt approval is required for sharing the sensitive personal data with foreign government or agency (34(1.3)).



Chapter 8: Exemptions


This chapter lists the exemptions from this act when Authority is satisfied that the application is for research, archiving and statistical purpose (38). Allowing sandbox environment for data processing in research and innovation is highlight in this chapter. 


To help startups, exemptions are provided with clauses like turnover of the small entity being low, carried out for a very brief period like just one day in a given year and innovative solutions in AI, ML or any other emerging technologies. Allowing exemption to sandbox environment for innovation would immensely help the research oriented organizations.


Below chapters in the bill provide details on the regulation and enforcement framework mainly. 


Chapter 9: Data protection authority of India


This section manly talks about the GOI intention to setup the authority and provides details on structure, duty,  of such authority. The framework setup information in this chapter is mainly for public service authorities than IT companies.



Chapter 10: Penalties and compensation


Important section for business houses to understand the seriousness of this bill. There are different types of penalties and fines listed for not being compliant with the law in this chapter. 



Chapter 11: Appellate tribunal


This chapter incorporates instruction for Government of India to establish Tribunal to hear out the cases and conflicts arising out of data protection issues. 



Chapter 12: Finance, Account and Audit


This chapter includes data protection authority fund allocation by government. Provides detailed instruction to public policy implementors within government. 



Chapter 13: Offences


This chapter discusses different types of penalties in the context of data protection law that include imprisonment and fines. This chapter is of paramount importance to legal department within data processors to understand the context and spread awareness among responsible executives. 



Chapter 14: Miscellaneous


The last chapter covers miscellaneous activities of authority and procedures to be followed in various scenarios around enactment of the data protection policy.



My View


This act is absolute essential for protecting individual data privacy and supporting digital economy growth. With growing digital products and services in the country, importance of data protection has taken centerstage. I strongly believe that the well implemented data protection act would enforce the citizens fundamental right on their privacy. This act is supposed to build user trust and confidence on the digital business carried out in this land. The bill has good intentions and objectives. Bill addresses most basic features like simple consent forms, data minimization, data corrections, data porting, breach notifications, restricted automated decisions with personal data, and most importantly citizen awareness.  


Some of the clauses in this bill are opposed and committee is reviewing them. I'm hopeful that this law once enacted would reduce misuse of personal data, ensures compliance, and promote data privacy awareness in India.




References:


JPC Report:

http://164.100.47.193/lsscommittee/Joint

%20Committee%20on%20the%20Personal%20Data%20Protection%20Bill,%202019/17_Joint_

Committee_on_the_Personal_Data_Protection_Bill_2019_1.pdf


Monday, February 28, 2022

Metrics for Data Sharing Platform

Enterprise architecture strategy mandates systems to measure the quantifiable metrics. There are challenges to this mandate for certain type of applications and environment those applications need to perform. Systems that work in backend with no user interface, systems in CIO which only caters to internal workforce are few examples for that. KPIs related to  revenue and UX performance are the most popular ones. There are many other SMART KPIs that helps teams to figure out health of the system. Here SMART is acronym for Specific, Measurable, Attainable, Relevant, and Timely metrics. As an example, I'm discussing about Data Sharing Platform in this blog to provoke readers thoughts.

Data Sharing System


Data in enterprise is generated and maintained by different business units. Multiple sources for trusted data require to be shared across enterprise to help other departments run business. This forms the requirement for data sharing system. Whole objective of such sharing system is to help data consumers gain accessible, analyzable and actionable business data to build contextual information with minimal effort. 

The three main areas that data sharing platform focus on are:
Data Catalog: Helps in maintaining organised data structure using metadata to help consumers discover, and explore data.  
Data Ingestion: Covers the extraction, transformation and loading of data from various data sources.
Data Governance: To ensure the data is clean, as per enterprise standard, and protected. This improves the integrity and reliability of information assets and metadata.
Data Accessibility: Help users acquire data in industry standard access mechanism and data format in self service way.  This reduces any steep learning to access and in turn saves time to users. 
Security and Data Privacy: Protecting sensitive data is the most critical aspect of any data platform. Its a no brainer! Encryption at rest and in motion, access restriction, confident/crown jewel data masking and etc. are critical features.  

There is always a challenge in obtaining quantifiable metrics for above areas but with little effort it is possible to define the relevant KPIs and measure them.

  • Data Volume
    • Data sources: Number of data sources from where data flows in to the platform. This proves the capability of system in connecting to various data sources especially when the sources are heterogeneous. 
    • Incoming flow rate: Data can flow either in real time or batch mode. The rate in which the data flows in to the platform needs to be measured. For eg: 2 million records per hour
    • Outgoing flow rate: Similar to Incoming data flow rate, outgoing flow needs to be collected as well. Various users or external systems where data flows should be logged and collected as matrix. 
    • Total Volume:The total size of the data in the system on any day
    • Real time and Batch jobs : Number of jobs to pull and push data in real time or in batch mode. 

  • Data Protection
    • Data restriction: Restricted data like US federal, DACH should not be accessed by all. The number of such restrictions should be captured for metrics purpose. 
    • Sensitive data protection: If there is a scenario where data needs to be encrypted or masked irreversibly, it must be measured as well. How many such fields are masked in what all entities is the good metrics for quicker understanding. 
    • Number of user/group roles: How many types of users/ groups accessing their system and each of their access roles must be recorded and monitored. 

  • Accessibility
    • Time to onboard new data consumers: The time taken to onboard users proves the good usability of the data sharing platform. Discovery, access, and data exploring are crucial for the users and building trust among the collaborators. 
    • Metadata views: Catalog provides the preface to the data. A good data catalog solves various issues and saves lot of time.  
    • Total number of consumers: This metrics not only shows the popularity, but also important for capacity planning of the system. 

  • Data Quality
    • Support tickets: The number of bugs/defects reported should be continuously measured. This  metrics is directly proportional to the hygienic overview of the system.
    • User Queries: The queries/help calls/chat sessions should be measured to monitor the ease of use.

  • Build and Maintenance Effort
    • Total Team Size
    • Updates: The number of updates/upgrades to the system in a given period shows the team effort and helps in sizing and revenue calculation
    • Cleanup: The cleanup activities taken up by administrators is another indicator for data quality. 

  • Infrastructure
    • Uptime : The system uptime is critical and shows business 
    • Cost: Infrastructure cost is easy to collect and required to derive various other business metrics. 
    • Licenses : Number of thirdparty licenses acquired to build the system, expiry date for each of them. 
    • Number of Hardware: Number of hardware resources allotted. 
    • Number of Software/SaaS : Number of software installed or SaaS services provisioned. 
    • Data Backup: Backup frequency, store format, users having access
    • Disaster Recovery: RTO (Recovery Time Objective) and RPO (Recovery Point Objective) are key metrics in determining database backup and disaster recovery requirements. 

  • Business KPIs
    • ROI: Collection of this key metrics is the ultimatum for every software. This requires deep financial understanding. Infrastructure cost, resources cost, operational cost and etc. needs be collected to derive the ROI. 
    • NPS: User surveys, questionnaire, campaigns and etc. provides information about this yet another key metrics. 




 
 





Wednesday, February 16, 2022

Future of Enterprise Application Integration(EAI) Middleware


There are various types of middleware in software industry. This blog limits the scope to EAI middleware that is widely used in enterprises. This type of
  middleware adds integration capability to a functional system that end users interacts with. It offers connection to heterogeneous systems, data transformation, fields mapping, routing logic and sometimes rule engine.  Technical teams also prefer middleware to unload certain heavy operations from main functional systems. 

Typical functionalities of EAI middleware

Tuesday, November 16, 2021

EAI Tool Building With IBM Public Cloud Services

There are plenty of EAI middleware tools in the market. Why build a new one in that case? The answer for this question is always debatable with various reasons cited like complete control on the tool, extensibility, customizable to your own unique use cases and of course low pricing. The low pricing part is more tempting with cheap cloud infrastructure, varying loads, and fast changing business dynamics supporting the justification process. 

If you are convinced that low price is the most sought out, don't forget to check important features that any EAI tool would perform and compare that with your requirements. Every popular tool in market pretty much offers basic features like minimal coding, quick mapping, real time/batch mode, multiple connectors, retry mechanism, error handling, monitoring, and reports. There are niche features like AI based data mapping or integration that some modern tools provide. If you are looking for most basic features and you don't want to end up learning yet another tool, its time to try IBM Public Cloud(IPC) services for developing your own EAI tool.

The microservices based containers over cloud architecture provides mechanism for building such tool quickly and efficiently. There are multiple offerings by IBM public cloud to support such requirements. There is a Cloudpak that comes with perfectly packaged services to quick start the journey. Other hand,  there are individual services right from computing clusters to database that you can cherrypick to build your own infrastructure package. 

I want to share my experience of setting up my own services and how easy it was. To start with, I wanted all the basic features that are mentioned above in the tool. Few requirements that I focussed are:

  • Integration options that was demanding REST API, Kafka and DB integration. So there is a clear need of multiple connectors
  • Data sync interval - Realtime is the requirement in my case
  • Exclusive data mapping with heterogeneous applications need to be integrated
  • Assured delivery - Can't miss even a single transaction 

The above requirements clearly pointing to the kind of system needs to be developed that includes :

  • API server to listen to calling application - A microservice with light weight server needs to be developed here
  • Independent connectors to connect to API, Kafka and DB of applications - Utility backend microservices for providing connection and retry logic 
  • Data Processor - Yet another microservice that process the data and prepares the mapping to a specific target system
  • A database that can persist data temporarily in case of connection issues
Techno-Functional overview


Development Effort

Listing the effort required to setup for fulfilling above requirements.

IPC infra overview

Minimal IPC Services required:
  • IKS cluster with 3 nodes  and PV - To develop microservices
  • MongoDB - NoSQL DB for temporary persistence
  • LogDNA - Log analysis for debugging purpose

Developing Microservices

Setting up Redhat Open Shift (ROKS) or IBM Kubernetes Service(IKS) to build cloud native containerized microservices takes not more than 20 minutes. This includes all operational tasks like creating separate resources for test and production environment.

The technology used is Python with Flask. Connectors are the main services that would take longer time as it requires end to end testing, retry mechanism to handle errors etc. using different integration approaches. 

Microservices created:

  • Three different connectors for DB, Kafka and REST API
  • Auth server for OAuth2.0
  • API server to receive requests
  • Data processor that handles mapping, transforming and orchestration
CronJob scheduler is created to handle error retries at regular intervals. 

MongDB Collection Setup and LogDNA setup

MongoDB is a quick setup with simple collection creation to store the failed records for later retries. LogDNA for logging analysis is also a fairly easy setup before you start coding and adding log statements.

DevOps Setup

  • Github setup is not part of Cloud services and it needs to be done separately 
  • Jenkins comes by default with ROKS. So you can quickly develop pipelines with the stages you need using groovy scripts. I added unit test, test coverage, static code analysis, build, and publish to image repository in CI pipeline
  • Separate test and production environment would take duplicated effort
  • Yaml files for shared configuration or Vault setup to manage secrets

Deployment Setup

Deployment task was setup as part of CD pipeline in Jenkins. Each microservices Yaml are created with 200m cpu and 256Mi memory limits and deployed with maximum 3 replicas based on 70% CPU utilization. Scheduler is created as cronjob and scheduled to run on specific intervals to check and process error records. 


Conclusion

The details shared above are at high level and the low level details would define the exact effort required. My intention is to share this as quick reference material as I experienced the easiest way of middleware development using IBM public cloud services. 



Tuesday, October 19, 2021

Data Migration - AI Usecase

Text Summarization During Data Migration - ML Usecase 


When data needs to be migrated from one system to another, the common challenge industry face is to deal with the data length incompatibility between source and target systems. When the target system cannot accept new data size fearing the impact on the existing setup and usage, this type of challenge surfaces. As modification is not an option, usually teams end up truncating the data when the source system data size is larger than the target system. Truncation is an easy option but comes with the cost of loosing precious information especially if it has to deal with Sales, Customer, or Financials related. 

With AI/ML advancing in natural language processing, this challenge can be countered with various  summarization techniques. 

My Usecase for Summarization

Few data attributes in the source system had 500 characters data length, and the target system was posing 255 characters constraint for that data attributes.  Migration team had two options: Either truncate data that are larger than 255 characters or meaningful reduction of text below 255 characters without losing the context and information

The words that were used in those attributes are mostly dates, acronyms, short form words, links, apart from nouns, verbs, adjectives, part of speech etc. that user notes down for further followups or like reminders.

AI Summarization Techniques

There are many thirdparty services that provide the text summarization feature. Explored IBM Watson, Lexalytics, MeaningCloud etc to check the accuracy. These services did not not show the accuracy that we wanted. 

Based on output type, two ways of summarization is possible. 

Extractive Summarization: Top N sentences are extracted to provide subset of information. 

Abstractive Summarization:  Key points are generated in concise form without losing the context. 

Mixed: Either Abstractive summary after identifying extractive intermediate state

Design

Low accuracy with Extractive approach

For our usecase mentioned above, below approaches and algorithms did not return concise summary without losing the context. I noticed either context loss or characters were not reduced below 255.  

  • TextRank : Most frequent occurring words ranked high
  • Sumy's Lexrank Summarizer: A sentence similar to most other sentences are ranked high in this algorithm
  • Sumy's KL- Word distribution similarity is the basis for sentence selection

Higher accuracy with BERT - Abstractive approach 

Google's BERT (Bidirectional Encoder Representations from Transformers ) is a deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus.
 

Technical Stack Used

Pytorch : Used to implement BERT model

Sentence Transformers : Used to get sentence embedding

Lang Detect : Library to identify Languages(Used to detect English only)

Pandas : Data framework

Regex: Used to get search pattern

NLTK:  Natural Language Toolkit is a popular text processing library used for tasks such as Stemming, Lemmatization, Tokenization, methods to process text using statistical techniques  etc.

Dateparser: For parsing date


Solution Overview

Preparing data for ML

Input text encountered is pretty diverse and had irregular pattern. This inconsistent pattern posed big challenge and data required to be processed before applying ML model on it.  Here is the list of such inconsistencies: 
  • Different date formats
  • Unfinished sentences
  • Parenthesis
  • Misspelled words
  • Use of non-homogenous Domain specific abbreviations.
  • Chat pattern
  • Acronyms
  • Chronological order data in some cases
  • Non- English Language 

Data Cleanup Process

  • Language detection applied and filtered to select only English language text
  • Links in the text are reduced to tiny URLs
  • Paranthesis and text within it is removed
  • Date formats are converted to numeric to reduce the characters
Using NLTK library, following process is applied on the input text.
  • Tokenization – It a very important task in NLP where the sentences are broken down semantically into words
  • Lemmatization – It is a process of transforming a word to it s root form. (Ex: going – go, completed - complete) 
  • Punctuation removal – Punctuations marks in the text is removed
  • Stop word removal – Stop Words such as of, such during, should, etc., are removed from the text
Using Regex,
  • N- grams (Phrases with N words) are mined from the dataset and common phrases are identified which are replaced with an alternate representation. 
Sample N-grams list composed and mapped to shorter version:

N_grams = {
'sent a follow up email':'mailed',
'a follow up email to':'mailed',
'sent an email notification to':'mailed',
'to see if we can':'maybe',
'sent a closure email to':'closure mailed',
'internal process of approval final':'approval',
.....

ML Model

The Parts of speech-based summarization is able to reduce character count to under 255 for ~80% records. As deleting sentences is last resort in this implementation, applied word reducing mechanism within sentences using BERT approach. 

With BERT trained on 2.5 billion words corpus, it becomes easy to extend it without worrying about finding training data. This neural network model is fine-tuned for this use case by training with the last stage text. The training data consists of sentence pairs with a similarity score, the model trains on identifying which words in the either of the sentences contribute to the similarity score. This learning effectively helps the model to produce a set of candidates which are similar in context to the given input sentence. 

The dates in the text are removed before passing it to the model as they are treated essential, Once the Model produces the output the dates are added back to the text. The sentences in the text, before given to model, are first broken-down to N-grams according to below table. If a sentence has character count of less than 20 (length of Text less than) then it is broken down to N-gram with 1 or 2 words (N-gram range) and out of such N-grams only 1 N-gram (Top N) can be picked to represent the sentence. Similarly, N-grams are made for each sentence and Top-N, N-grams are picked in final summary.

All the N-grams made are passed to BERT model which encodes each N-gram to a candidate embedding (N-gram/phrase is represented in the form of vector). The original sentence is also encoded, which is called document encoding. These embeddings are passed along with Top-N value and diversity of final summary (fixed at 0.5) to Maximum marginal relevance (MMR) [10]. MMR finds the top-N candidates which best represent the given sentence, these candidates/phrases are added to summary in place of that input sentence. The dates are added back to sentence by parsing BERT summarized text with original text.

Result

With BERT model, around 92% records got reduced to less than 255 characters. For test data size 1000 records, 

Rows with word count < 255 - 919/997

Rows with word count < 265 -  42/997

Rows with word count < 275 -  9/997

Rows with word count < 285 -  27/997









Monday, April 12, 2021

Prometheus and Grafana in IBM Cloud Openshift - System requirements

 Introduction

Prometheus is a popular open source monitoring system and Grafana, open source tool compliments it in visualization aspects. Combination of these two tools helps the users to understand the complex data with the help of data metrics of any containerized systems. This combination is also more popular and common monitoring stack used by Devops teams.

Prometheus 

Prometheus is a system to collect and process metrics, not an event logging system. The main Prometheus server runs standalone and has no external dependencies. It collects metrics, stores them, and makes them available for querying, sends alerts based on the metrics collected. The details provided here is tested with 2.20 and higher and may not be applicable to earlier versions. 

Prometheus Concepts

To plan and exercise sizing requirement for Prometheus, below concepts need to be understood first.

TimeSeries is streams of timestamped values belonging to the same metric and the same set of labeled dimensions. Besides stored time series, Prometheus may generate temporary derived time series as the result of queries. Prometheus can handle millions of time series. Memory usage is directly proportional to time series count. A time series is thus represented as a series of chunks, which ultimately end up in a time series file (one file per time series) on disk.

prometheus_local_storage_memory_series: The current number of series held in memory

Scrape: Prometheus is a pull-based system. To fetch metrics, Prometheus sends an HTTP request called a scrape. It sends scrapes to targets based on its configuration.

Metrics & Labels: Every time series is uniquely identified by its metric name and optional key-value pairs called labels. The metric name specifies the general feature of a system that is measured (e.g. http_requests_total - the total number of HTTP requests received).  4 types of metrics are Counter, Gauge, Histogram, & Summary

Labels: enable Prometheus's dimensional data model: any given combination of labels for the same metric name identifies a particular dimensional instantiation of that metric (for example: all HTTP requests that used the method POST to the /api/tracks handler). 

Samples form the actual time series data. Each sample consists of a float64 value and a millisecond-precision timestamp

Instance/Target & Job: In Prometheus terms, an endpoint you can scrape is called an instance, usually corresponding to a single process. A collection of instances with the same purpose, a process replicated for scalability or reliability for example, is called a job.

Capacity planning exercise for Prometheus

Planning for sizing predominantely includes Memory usage, Disk usage & CPU usage.

Memory usage There are 2 parts in Memory usage: Ingestion and Query. Both needs to be considered in capacity planning for Prometheus.

Data ingestion: Memory requirement depends on the number of time series, the number of labels you have, and your scrape frequency in addition to the raw ingest rate. Finally this capacity needs to be considereing 50% more for garbage collection overhead.

Query: It is important to consider the concurrency and the complex customized query requirement to query data from Prometheeus. 

Found this online capacity planning calculator helpful in validating your requirements. 

Disk Usage

The Prometheus server will store the metrics in a local folder, for a period of 15 days, by default.Any production-ready deployment requires you to configure a persistent storage interface that will be able to maintain historical metrics data and survive pod restarts.

Prometheus stores its on-disk time series data under the directory specified by the flag storage.local.path (The default path is ./data). The flag storage.local.retention allows you to configure the retention time for samples.

Thumb rule that Prometheus recommends to determine the Disk requirement is calculated as below:

needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample
For example: For 15 days storage 1296000(seconds) * 10000 
(samples/second) * 1.3(bytes/sample) = 16,848,000,000 (bytes). Which 

would be approximately 16 Gigabytes. 

To lower the rate of ingested samples, you can either reduce the number of time series you scrape (fewer targets or fewer series per target), or you can increase the scrape interval. However, reducing the number of series is likely more effective, due to compression of samples within a series.

More details on Prometheus storage can be found here.

Scale out

There are in fact various ways to scale and federate Prometheus. The architecture is to have multiple sharded Prometheis, each scraping a subset of the targets and aggregating them up within the shard. A leader federates the aggregates produced by the shards, and then the leader aggregates them up to the job level.

An interesting read on Scale out is here for further information.

Grafana

Grafana requirement is simple and it just requires minimum 255mb RAM and a single core. You might need a little more RAM if the requirement includes: * Server side rendering of images * Alerting * Data source proxy

The bottleneck for Grafana performance is the time series database backend with complex queries. By default, Grafana comes with SQLite, an embedded database stored in the Grafana installation location.


Wednesday, May 29, 2019

Error Handling in Application Integration

Error Handling in Application Integration


Application integration not always comes with happy paths irrespective of domains and the mechanism adopted. More and more enterprises are adopting micro services that calls for integration of various applications in real time. Not always this integration result in successful data flow due to various reasons that includes business, application and network errors or limitations. Handling such failures in real time is critical to the functioning of systems and fulfill the assured delivery requirement. Understanding the types of errors/failures, processing, retrying and transforming provides higher rate of success.

The areas or activities that need to be looked into solve this problem are listed below:
  • Identifying the errors
  • Defining error categories
  • Formulating recoverable and non recoverable errors
  • Defining workflow steps - Automated or Manual 
  • Retry mechanism for recoverable errors
  • Persistence logic for long term recovery
  • Message reconstruction process
  • Defining manual intervention process

Usecase with REST API integration

Error Identification and Categorization


400 category : User Input Error(System related)
•401  - Authentication issue : Retry for getting fresh token. Even after fails, then send alert.
•404 – Retry logic required
•403 – May be one time retry and then Alert

500 category – Internal Server Error (Business validation related)

•For messages, it is important figure out all the codes from target systems.
•Important to find out if the error codes is common or specific to each type of data invalidation in each spoke. We have to start only from the codes. 


Recoverable and non recoverable errors

Errors like temporary network failures, application maintenance downtime can be recoverable. Data validation error requires transformation either automatically or manually depending on the complexity of business rules. Retry logic should be designed such a way to handle these cases individually. 

Retry mechanism for recoverable errors

Picture below illustrates Short term and long term retry logic. 



Circuit breaker design

The simple circuit breaker is used to with short term retry to avoid making the external call when the circuit is open, and the breaker itself should detect if the underlying calls are working again. We can implement this self-resetting behavior by trying the remote call again after a suitable interval, and resetting the breaker if it succeed. This also prevent the unexpected failures with remote calls. 


    



Persistence logic for long term recovery

The error along with the  message need to be temporarily stored in order to resend the message after the correction has been made to the integration flow. The data model should include: 

Configurable fields for Category Definition:
1. Retry attempt – Number
2. Frequency/duration – Number  - to show minutes
3. Alert required – varchar -  to store email
4. Manual Flag – int -  Manual, Auto, None
5. Priority  - int - High, Medium, Low

These are additional message fields:
1. Status – varchar or int to store Pending, Completed, On Going. – (Message Table)
2. Unique id (Trace ID) - required to identify the error apart from message  object so that we can treat each message differently. This has to be primary key.  (Message table)
3. ID column has to be oppty/LI id. 

Defining manual intervention process

There are errors that cannot be resolved automatically by program logic. This kind of error call for manual intervention. For example, if target system is expecting alpha numerical value for certain mandatory field and source system sends numerical value, the synchronization fails in this case. Unless the value is changed to meet the target system, it is not possible to make the flow successful. User intervention is required most of the time to correct the data or system. Every such cases needs be identified, processes to be well defined and messages handled accordingly before resending the message.   

Conclusion

Few other areas that need to be included are Scheduler for triggering the long term retry, message reconstruction process, and  alert/notifying mechanism. With these systems in place, it becomes easy and manageable to handle both expected and unexpected error scenarios. 
Application integration is never complete without robust error handling process and the benefit of that is enormous. Happy to assist further and take your feedback especially the improvements that you could think of.