Text Summarization During Data Migration - ML Usecase
When data needs to be migrated from one system to another, the common challenge industry face is to deal with the data length incompatibility between source and target systems. When the target system cannot accept new data size fearing the impact on the existing setup and usage, this type of challenge surfaces. As modification is not an option, usually teams end up truncating the data when the source system data size is larger than the target system. Truncation is an easy option but comes with the cost of loosing precious information especially if it has to deal with Sales, Customer, or Financials related.
With AI/ML advancing in natural language processing, this challenge can be countered with various summarization techniques.
My Usecase for Summarization
Few data attributes in the source system had 500 characters data length, and the target system was posing 255 characters constraint for that data attributes. Migration team had two options: Either truncate data that are larger than 255 characters or meaningful reduction of text below 255 characters without losing the context and information.
The words that were used in those attributes are mostly dates, acronyms, short form words, links, apart from nouns, verbs, adjectives, part of speech etc. that user notes down for further followups or like reminders.
AI Summarization Techniques
There are many thirdparty services that provide the text summarization feature. Explored IBM Watson, Lexalytics, MeaningCloud etc to check the accuracy. These services did not not show the accuracy that we wanted.
Based on output type, two ways of summarization is possible.
Extractive Summarization: Top N sentences are extracted to provide subset of information.
Abstractive Summarization: Key points are generated in concise form without losing the context.
Mixed: Either Abstractive summary after identifying extractive intermediate state
Design
Low accuracy with Extractive approach
- TextRank : Most frequent occurring words ranked high
- Sumy's Lexrank Summarizer: A sentence similar to most other sentences are ranked high in this algorithm
- Sumy's KL- Word distribution similarity is the basis for sentence selection
Higher accuracy with BERT - Abstractive approach
Technical Stack Used
Pytorch : Used to implement BERT model
Sentence Transformers : Used to get sentence embedding
Lang Detect : Library to identify Languages(Used to detect English only)
Pandas : Data framework
Regex: Used to get search pattern
NLTK: Natural Language Toolkit is a popular text processing library used for tasks such as Stemming, Lemmatization, Tokenization, methods to process text using statistical techniques etc.
Dateparser: For parsing date
Solution Overview
Preparing data for ML
- Different date formats
- Unfinished sentences
- Parenthesis
- Misspelled words
- Use of non-homogenous Domain specific abbreviations.
- Chat pattern
- Acronyms
- Chronological order data in some cases
- Non- English Language
Data Cleanup Process
- Language detection applied and filtered to select only English language text
- Links in the text are reduced to tiny URLs
- Paranthesis and text within it is removed
- Date formats are converted to numeric to reduce the characters
- Tokenization – It a very important task in NLP where the sentences are broken down semantically into words
- Lemmatization – It is a process of transforming a word to it s root form. (Ex: going – go, completed - complete)
- Punctuation removal – Punctuations marks in the text is removed
- Stop word removal – Stop Words such as of, such during, should, etc., are removed from the text
- N- grams (Phrases with N words) are mined from the dataset and common phrases are identified which are replaced with an alternate representation.
N_grams = { 'sent a follow up email':'mailed', |
'a follow up email to':'mailed', |
'sent an email notification to':'mailed', |
'to see if we can':'maybe', |
'sent a closure email to':'closure mailed', |
'internal process of approval final':'approval', |
ML Model
The Parts of speech-based summarization is able to reduce character count to under 255 for ~80% records. As deleting sentences is last resort in this implementation, applied word reducing mechanism within sentences using BERT approach.
With BERT trained on 2.5 billion words corpus, it becomes easy to extend it without worrying about finding training data. This neural network model is fine-tuned for this use case by training with the last stage text. The training data consists of sentence pairs with a similarity score, the model trains on identifying which words in the either of the sentences contribute to the similarity score. This learning effectively helps the model to produce a set of candidates which are similar in context to the given input sentence.
The dates in the text are removed before passing it to the model as they are treated essential, Once the Model produces the output the dates are added back to the text. The sentences in the text, before given to model, are first broken-down to N-grams according to below table. If a sentence has character count of less than 20 (length of Text less than) then it is broken down to N-gram with 1 or 2 words (N-gram range) and out of such N-grams only 1 N-gram (Top N) can be picked to represent the sentence. Similarly, N-grams are made for each sentence and Top-N, N-grams are picked in final summary.
All the N-grams made are passed to BERT model which encodes each N-gram to a candidate embedding (N-gram/phrase is represented in the form of vector). The original sentence is also encoded, which is called document encoding. These embeddings are passed along with Top-N value and diversity of final summary (fixed at 0.5) to Maximum marginal relevance (MMR) [10]. MMR finds the top-N candidates which best represent the given sentence, these candidates/phrases are added to summary in place of that input sentence. The dates are added back to sentence by parsing BERT summarized text with original text.
Result
With BERT model, around 92% records got reduced to less than 255 characters. For test data size 1000 records,
Rows with word count < 255 - 919/997
No comments:
Post a Comment