Tuesday, October 19, 2021

Data Migration - AI Usecase

Text Summarization During Data Migration - ML Usecase 


When data needs to be migrated from one system to another, the common challenge industry face is to deal with the data length incompatibility between source and target systems. When the target system cannot accept new data size fearing the impact on the existing setup and usage, this type of challenge surfaces. As modification is not an option, usually teams end up truncating the data when the source system data size is larger than the target system. Truncation is an easy option but comes with the cost of loosing precious information especially if it has to deal with Sales, Customer, or Financials related. 

With AI/ML advancing in natural language processing, this challenge can be countered with various  summarization techniques. 

My Usecase for Summarization

Few data attributes in the source system had 500 characters data length, and the target system was posing 255 characters constraint for that data attributes.  Migration team had two options: Either truncate data that are larger than 255 characters or meaningful reduction of text below 255 characters without losing the context and information

The words that were used in those attributes are mostly dates, acronyms, short form words, links, apart from nouns, verbs, adjectives, part of speech etc. that user notes down for further followups or like reminders.

AI Summarization Techniques

There are many thirdparty services that provide the text summarization feature. Explored IBM Watson, Lexalytics, MeaningCloud etc to check the accuracy. These services did not not show the accuracy that we wanted. 

Based on output type, two ways of summarization is possible. 

Extractive Summarization: Top N sentences are extracted to provide subset of information. 

Abstractive Summarization:  Key points are generated in concise form without losing the context. 

Mixed: Either Abstractive summary after identifying extractive intermediate state

Design

Low accuracy with Extractive approach

For our usecase mentioned above, below approaches and algorithms did not return concise summary without losing the context. I noticed either context loss or characters were not reduced below 255.  

  • TextRank : Most frequent occurring words ranked high
  • Sumy's Lexrank Summarizer: A sentence similar to most other sentences are ranked high in this algorithm
  • Sumy's KL- Word distribution similarity is the basis for sentence selection

Higher accuracy with BERT - Abstractive approach 

Google's BERT (Bidirectional Encoder Representations from Transformers ) is a deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus.
 

Technical Stack Used

Pytorch : Used to implement BERT model

Sentence Transformers : Used to get sentence embedding

Lang Detect : Library to identify Languages(Used to detect English only)

Pandas : Data framework

Regex: Used to get search pattern

NLTK:  Natural Language Toolkit is a popular text processing library used for tasks such as Stemming, Lemmatization, Tokenization, methods to process text using statistical techniques  etc.

Dateparser: For parsing date


Solution Overview

Preparing data for ML

Input text encountered is pretty diverse and had irregular pattern. This inconsistent pattern posed big challenge and data required to be processed before applying ML model on it.  Here is the list of such inconsistencies: 
  • Different date formats
  • Unfinished sentences
  • Parenthesis
  • Misspelled words
  • Use of non-homogenous Domain specific abbreviations.
  • Chat pattern
  • Acronyms
  • Chronological order data in some cases
  • Non- English Language 

Data Cleanup Process

  • Language detection applied and filtered to select only English language text
  • Links in the text are reduced to tiny URLs
  • Paranthesis and text within it is removed
  • Date formats are converted to numeric to reduce the characters
Using NLTK library, following process is applied on the input text.
  • Tokenization – It a very important task in NLP where the sentences are broken down semantically into words
  • Lemmatization – It is a process of transforming a word to it s root form. (Ex: going – go, completed - complete) 
  • Punctuation removal – Punctuations marks in the text is removed
  • Stop word removal – Stop Words such as of, such during, should, etc., are removed from the text
Using Regex,
  • N- grams (Phrases with N words) are mined from the dataset and common phrases are identified which are replaced with an alternate representation. 
Sample N-grams list composed and mapped to shorter version:

N_grams = {
'sent a follow up email':'mailed',
'a follow up email to':'mailed',
'sent an email notification to':'mailed',
'to see if we can':'maybe',
'sent a closure email to':'closure mailed',
'internal process of approval final':'approval',
.....

ML Model

The Parts of speech-based summarization is able to reduce character count to under 255 for ~80% records. As deleting sentences is last resort in this implementation, applied word reducing mechanism within sentences using BERT approach. 

With BERT trained on 2.5 billion words corpus, it becomes easy to extend it without worrying about finding training data. This neural network model is fine-tuned for this use case by training with the last stage text. The training data consists of sentence pairs with a similarity score, the model trains on identifying which words in the either of the sentences contribute to the similarity score. This learning effectively helps the model to produce a set of candidates which are similar in context to the given input sentence. 

The dates in the text are removed before passing it to the model as they are treated essential, Once the Model produces the output the dates are added back to the text. The sentences in the text, before given to model, are first broken-down to N-grams according to below table. If a sentence has character count of less than 20 (length of Text less than) then it is broken down to N-gram with 1 or 2 words (N-gram range) and out of such N-grams only 1 N-gram (Top N) can be picked to represent the sentence. Similarly, N-grams are made for each sentence and Top-N, N-grams are picked in final summary.

All the N-grams made are passed to BERT model which encodes each N-gram to a candidate embedding (N-gram/phrase is represented in the form of vector). The original sentence is also encoded, which is called document encoding. These embeddings are passed along with Top-N value and diversity of final summary (fixed at 0.5) to Maximum marginal relevance (MMR) [10]. MMR finds the top-N candidates which best represent the given sentence, these candidates/phrases are added to summary in place of that input sentence. The dates are added back to sentence by parsing BERT summarized text with original text.

Result

With BERT model, around 92% records got reduced to less than 255 characters. For test data size 1000 records, 

Rows with word count < 255 - 919/997

Rows with word count < 265 -  42/997

Rows with word count < 275 -  9/997

Rows with word count < 285 -  27/997