Tuesday, November 16, 2021

EAI Tool Building With IBM Public Cloud Services

There are plenty of EAI middleware tools in the market. Why build a new one in that case? The answer for this question is always debatable with various reasons cited like complete control on the tool, extensibility, customizable to your own unique use cases and of course low pricing. The low pricing part is more tempting with cheap cloud infrastructure, varying loads, and fast changing business dynamics supporting the justification process. 

If you are convinced that low price is the most sought out, don't forget to check important features that any EAI tool would perform and compare that with your requirements. Every popular tool in market pretty much offers basic features like minimal coding, quick mapping, real time/batch mode, multiple connectors, retry mechanism, error handling, monitoring, and reports. There are niche features like AI based data mapping or integration that some modern tools provide. If you are looking for most basic features and you don't want to end up learning yet another tool, its time to try IBM Public Cloud(IPC) services for developing your own EAI tool.

The microservices based containers over cloud architecture provides mechanism for building such tool quickly and efficiently. There are multiple offerings by IBM public cloud to support such requirements. There is a Cloudpak that comes with perfectly packaged services to quick start the journey. Other hand,  there are individual services right from computing clusters to database that you can cherrypick to build your own infrastructure package. 

I want to share my experience of setting up my own services and how easy it was. To start with, I wanted all the basic features that are mentioned above in the tool. Few requirements that I focussed are:

  • Integration options that was demanding REST API, Kafka and DB integration. So there is a clear need of multiple connectors
  • Data sync interval - Realtime is the requirement in my case
  • Exclusive data mapping with heterogeneous applications need to be integrated
  • Assured delivery - Can't miss even a single transaction 

The above requirements clearly pointing to the kind of system needs to be developed that includes :

  • API server to listen to calling application - A microservice with light weight server needs to be developed here
  • Independent connectors to connect to API, Kafka and DB of applications - Utility backend microservices for providing connection and retry logic 
  • Data Processor - Yet another microservice that process the data and prepares the mapping to a specific target system
  • A database that can persist data temporarily in case of connection issues
Techno-Functional overview


Development Effort

Listing the effort required to setup for fulfilling above requirements.

IPC infra overview

Minimal IPC Services required:
  • IKS cluster with 3 nodes  and PV - To develop microservices
  • MongoDB - NoSQL DB for temporary persistence
  • LogDNA - Log analysis for debugging purpose

Developing Microservices

Setting up Redhat Open Shift (ROKS) or IBM Kubernetes Service(IKS) to build cloud native containerized microservices takes not more than 20 minutes. This includes all operational tasks like creating separate resources for test and production environment.

The technology used is Python with Flask. Connectors are the main services that would take longer time as it requires end to end testing, retry mechanism to handle errors etc. using different integration approaches. 

Microservices created:

  • Three different connectors for DB, Kafka and REST API
  • Auth server for OAuth2.0
  • API server to receive requests
  • Data processor that handles mapping, transforming and orchestration
CronJob scheduler is created to handle error retries at regular intervals. 

MongDB Collection Setup and LogDNA setup

MongoDB is a quick setup with simple collection creation to store the failed records for later retries. LogDNA for logging analysis is also a fairly easy setup before you start coding and adding log statements.

DevOps Setup

  • Github setup is not part of Cloud services and it needs to be done separately 
  • Jenkins comes by default with ROKS. So you can quickly develop pipelines with the stages you need using groovy scripts. I added unit test, test coverage, static code analysis, build, and publish to image repository in CI pipeline
  • Separate test and production environment would take duplicated effort
  • Yaml files for shared configuration or Vault setup to manage secrets

Deployment Setup

Deployment task was setup as part of CD pipeline in Jenkins. Each microservices Yaml are created with 200m cpu and 256Mi memory limits and deployed with maximum 3 replicas based on 70% CPU utilization. Scheduler is created as cronjob and scheduled to run on specific intervals to check and process error records. 


Conclusion

The details shared above are at high level and the low level details would define the exact effort required. My intention is to share this as quick reference material as I experienced the easiest way of middleware development using IBM public cloud services. 



Tuesday, October 19, 2021

Data Migration - AI Usecase

Text Summarization During Data Migration - ML Usecase 


When data needs to be migrated from one system to another, the common challenge industry face is to deal with the data length incompatibility between source and target systems. When the target system cannot accept new data size fearing the impact on the existing setup and usage, this type of challenge surfaces. As modification is not an option, usually teams end up truncating the data when the source system data size is larger than the target system. Truncation is an easy option but comes with the cost of loosing precious information especially if it has to deal with Sales, Customer, or Financials related. 

With AI/ML advancing in natural language processing, this challenge can be countered with various  summarization techniques. 

My Usecase for Summarization

Few data attributes in the source system had 500 characters data length, and the target system was posing 255 characters constraint for that data attributes.  Migration team had two options: Either truncate data that are larger than 255 characters or meaningful reduction of text below 255 characters without losing the context and information

The words that were used in those attributes are mostly dates, acronyms, short form words, links, apart from nouns, verbs, adjectives, part of speech etc. that user notes down for further followups or like reminders.

AI Summarization Techniques

There are many thirdparty services that provide the text summarization feature. Explored IBM Watson, Lexalytics, MeaningCloud etc to check the accuracy. These services did not not show the accuracy that we wanted. 

Based on output type, two ways of summarization is possible. 

Extractive Summarization: Top N sentences are extracted to provide subset of information. 

Abstractive Summarization:  Key points are generated in concise form without losing the context. 

Mixed: Either Abstractive summary after identifying extractive intermediate state

Design

Low accuracy with Extractive approach

For our usecase mentioned above, below approaches and algorithms did not return concise summary without losing the context. I noticed either context loss or characters were not reduced below 255.  

  • TextRank : Most frequent occurring words ranked high
  • Sumy's Lexrank Summarizer: A sentence similar to most other sentences are ranked high in this algorithm
  • Sumy's KL- Word distribution similarity is the basis for sentence selection

Higher accuracy with BERT - Abstractive approach 

Google's BERT (Bidirectional Encoder Representations from Transformers ) is a deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus.
 

Technical Stack Used

Pytorch : Used to implement BERT model

Sentence Transformers : Used to get sentence embedding

Lang Detect : Library to identify Languages(Used to detect English only)

Pandas : Data framework

Regex: Used to get search pattern

NLTK:  Natural Language Toolkit is a popular text processing library used for tasks such as Stemming, Lemmatization, Tokenization, methods to process text using statistical techniques  etc.

Dateparser: For parsing date


Solution Overview

Preparing data for ML

Input text encountered is pretty diverse and had irregular pattern. This inconsistent pattern posed big challenge and data required to be processed before applying ML model on it.  Here is the list of such inconsistencies: 
  • Different date formats
  • Unfinished sentences
  • Parenthesis
  • Misspelled words
  • Use of non-homogenous Domain specific abbreviations.
  • Chat pattern
  • Acronyms
  • Chronological order data in some cases
  • Non- English Language 

Data Cleanup Process

  • Language detection applied and filtered to select only English language text
  • Links in the text are reduced to tiny URLs
  • Paranthesis and text within it is removed
  • Date formats are converted to numeric to reduce the characters
Using NLTK library, following process is applied on the input text.
  • Tokenization – It a very important task in NLP where the sentences are broken down semantically into words
  • Lemmatization – It is a process of transforming a word to it s root form. (Ex: going – go, completed - complete) 
  • Punctuation removal – Punctuations marks in the text is removed
  • Stop word removal – Stop Words such as of, such during, should, etc., are removed from the text
Using Regex,
  • N- grams (Phrases with N words) are mined from the dataset and common phrases are identified which are replaced with an alternate representation. 
Sample N-grams list composed and mapped to shorter version:

N_grams = {
'sent a follow up email':'mailed',
'a follow up email to':'mailed',
'sent an email notification to':'mailed',
'to see if we can':'maybe',
'sent a closure email to':'closure mailed',
'internal process of approval final':'approval',
.....

ML Model

The Parts of speech-based summarization is able to reduce character count to under 255 for ~80% records. As deleting sentences is last resort in this implementation, applied word reducing mechanism within sentences using BERT approach. 

With BERT trained on 2.5 billion words corpus, it becomes easy to extend it without worrying about finding training data. This neural network model is fine-tuned for this use case by training with the last stage text. The training data consists of sentence pairs with a similarity score, the model trains on identifying which words in the either of the sentences contribute to the similarity score. This learning effectively helps the model to produce a set of candidates which are similar in context to the given input sentence. 

The dates in the text are removed before passing it to the model as they are treated essential, Once the Model produces the output the dates are added back to the text. The sentences in the text, before given to model, are first broken-down to N-grams according to below table. If a sentence has character count of less than 20 (length of Text less than) then it is broken down to N-gram with 1 or 2 words (N-gram range) and out of such N-grams only 1 N-gram (Top N) can be picked to represent the sentence. Similarly, N-grams are made for each sentence and Top-N, N-grams are picked in final summary.

All the N-grams made are passed to BERT model which encodes each N-gram to a candidate embedding (N-gram/phrase is represented in the form of vector). The original sentence is also encoded, which is called document encoding. These embeddings are passed along with Top-N value and diversity of final summary (fixed at 0.5) to Maximum marginal relevance (MMR) [10]. MMR finds the top-N candidates which best represent the given sentence, these candidates/phrases are added to summary in place of that input sentence. The dates are added back to sentence by parsing BERT summarized text with original text.

Result

With BERT model, around 92% records got reduced to less than 255 characters. For test data size 1000 records, 

Rows with word count < 255 - 919/997

Rows with word count < 265 -  42/997

Rows with word count < 275 -  9/997

Rows with word count < 285 -  27/997









Monday, April 12, 2021

Prometheus and Grafana in IBM Cloud Openshift - System requirements

 Introduction

Prometheus is a popular open source monitoring system and Grafana, open source tool compliments it in visualization aspects. Combination of these two tools helps the users to understand the complex data with the help of data metrics of any containerized systems. This combination is also more popular and common monitoring stack used by Devops teams.

Prometheus 

Prometheus is a system to collect and process metrics, not an event logging system. The main Prometheus server runs standalone and has no external dependencies. It collects metrics, stores them, and makes them available for querying, sends alerts based on the metrics collected. The details provided here is tested with 2.20 and higher and may not be applicable to earlier versions. 

Prometheus Concepts

To plan and exercise sizing requirement for Prometheus, below concepts need to be understood first.

TimeSeries is streams of timestamped values belonging to the same metric and the same set of labeled dimensions. Besides stored time series, Prometheus may generate temporary derived time series as the result of queries. Prometheus can handle millions of time series. Memory usage is directly proportional to time series count. A time series is thus represented as a series of chunks, which ultimately end up in a time series file (one file per time series) on disk.

prometheus_local_storage_memory_series: The current number of series held in memory

Scrape: Prometheus is a pull-based system. To fetch metrics, Prometheus sends an HTTP request called a scrape. It sends scrapes to targets based on its configuration.

Metrics & Labels: Every time series is uniquely identified by its metric name and optional key-value pairs called labels. The metric name specifies the general feature of a system that is measured (e.g. http_requests_total - the total number of HTTP requests received).  4 types of metrics are Counter, Gauge, Histogram, & Summary

Labels: enable Prometheus's dimensional data model: any given combination of labels for the same metric name identifies a particular dimensional instantiation of that metric (for example: all HTTP requests that used the method POST to the /api/tracks handler). 

Samples form the actual time series data. Each sample consists of a float64 value and a millisecond-precision timestamp

Instance/Target & Job: In Prometheus terms, an endpoint you can scrape is called an instance, usually corresponding to a single process. A collection of instances with the same purpose, a process replicated for scalability or reliability for example, is called a job.

Capacity planning exercise for Prometheus

Planning for sizing predominantely includes Memory usage, Disk usage & CPU usage.

Memory usage There are 2 parts in Memory usage: Ingestion and Query. Both needs to be considered in capacity planning for Prometheus.

Data ingestion: Memory requirement depends on the number of time series, the number of labels you have, and your scrape frequency in addition to the raw ingest rate. Finally this capacity needs to be considereing 50% more for garbage collection overhead.

Query: It is important to consider the concurrency and the complex customized query requirement to query data from Prometheeus. 

Found this online capacity planning calculator helpful in validating your requirements. 

Disk Usage

The Prometheus server will store the metrics in a local folder, for a period of 15 days, by default.Any production-ready deployment requires you to configure a persistent storage interface that will be able to maintain historical metrics data and survive pod restarts.

Prometheus stores its on-disk time series data under the directory specified by the flag storage.local.path (The default path is ./data). The flag storage.local.retention allows you to configure the retention time for samples.

Thumb rule that Prometheus recommends to determine the Disk requirement is calculated as below:

needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample
For example: For 15 days storage 1296000(seconds) * 10000 
(samples/second) * 1.3(bytes/sample) = 16,848,000,000 (bytes). Which 

would be approximately 16 Gigabytes. 

To lower the rate of ingested samples, you can either reduce the number of time series you scrape (fewer targets or fewer series per target), or you can increase the scrape interval. However, reducing the number of series is likely more effective, due to compression of samples within a series.

More details on Prometheus storage can be found here.

Scale out

There are in fact various ways to scale and federate Prometheus. The architecture is to have multiple sharded Prometheis, each scraping a subset of the targets and aggregating them up within the shard. A leader federates the aggregates produced by the shards, and then the leader aggregates them up to the job level.

An interesting read on Scale out is here for further information.

Grafana

Grafana requirement is simple and it just requires minimum 255mb RAM and a single core. You might need a little more RAM if the requirement includes: * Server side rendering of images * Alerting * Data source proxy

The bottleneck for Grafana performance is the time series database backend with complex queries. By default, Grafana comes with SQLite, an embedded database stored in the Grafana installation location.