Saturday, December 16, 2023

What is IP Geolocation that caused Zerodha's outage

You probably read about it all in the news on outage in Zerodha's online stock trading platform causing huge inconvenience to its users. In case you haven't heard, visit this page to learn more about it.

Zerodha apologised and clarified that the IP/Geolocation database routine upgrade caused the outage. Based on IP addresses, it notified its users of logins from new geographical locations which sent out an unexpectedly large number of alerts. The incident details note mentions that this led to a large influx of password reset requests from perplexed users, putting a heavy load on their login systems and in turn resulting in login failures. If you are wondering about how this IP/GeoLocation database works, read further to get more details.

What is IP geolocation?

IP geolocation is a technology for websites that need to determine users location, simply by sourcing a user’s IP address location. Hope you have fair understanding on IP and compute networking as Im not briefing them here. The popularity of the mobile web and application paradigms introduced the necessity to contextualize the service provided with users location information. While smartphones included this capability using modern cellular technology, the complexity of the web standards and the privacy requirements concerns made the structured gathering of such information difficult.

There are many techniques to collect such information. However, the most used technique to get positioning information without the explicit gathering of GPS data is IP Geolocation. This information comes handy in online services landscape, geofencing, fraud detection, and online advertising areas. An IP GeoLocation database provides the mapping between any IP address in the world to a latitude and, longitude coordinates. Despite the questionable accuracy and precision, these databases are the standard approach for location-based services on the Internet.

There are many vendors including free and commercial ones. Usualy these vendors do not disclose the algorithm and techniques they use to provide the mapping between IP and location. From generic information published on the vendor websites it seems that they are likely using a combination of active and passive measurement, possibly combined with machine learning technologies.

Challenges

IP geolocation database technique poses many challenges. With the increasing use of VPNs, complexity further increases. An IP geolocation database relies on the IP address used by very few users in a certain area. Therefore, it’s difficult to achieve 100% accuracy with IP geolocation data. To add to that, IP geolocation moves between different physical or virtual devices and also between geographies. This makes it even more challenging to ensure the accuracy of the IP addresses in the database. Also, the geolocation of a mobile device might be even more complicated as they obtain new IP addresses as they shift location.

These service providers usually offer two common ways to access this data. One can either download the database itself or access it through an application programming interface(API).

Alternatives

HTML5, GPS, user registration data, and cookies approach are some of the alternatives for IP Geolocation database.

HTML5 primarily tracks users through the browsers they are using, many prominent browsers have this feature enabled, and it works on a per-session basis.

GPS is a popular form of location tracking on mobile devices; it is by far the most accurate form of geolocation, being able to discern within a few feet the user’s exact location.

User registration data and cookies are both primarily associated with websites. Based on the user registration information supplied, content is personalised for the user on an individual basis.

Cookies on the other hand store location information that users have provided elsewhere. Cookies are also capable to gather user behaviour while users are browsing the web.

These alternatives comes with their own disadvantages, making IP based geo targeting the optimal solution for tech-savvy entrepreneurs. For example, GPS is firmly entrenched in the mobile space, while HTML5 and cookies depend on browser.

Conclusion

Hope this brief article helped you to gain basic understanding on IP Geolocation database and the need of it in Zerodha's business that constantly requires to collect users location for achieving localization, location based offering, promotion and etc. I'm concluding this article with my concern on data protection, privacy, legal use of information while such location data is getting accumulated and utilized all over.

Further read

https://www.abstractapi.com/guides/why-is-my-ip-geolocation-wrong

Saturday, September 2, 2023

AI Powered API Mocker - Powering Test Automation

Introduction

APIs(Application Programming Interface) are the most popular way to programmatically integrate different applications. Applications in enterprise often required to be integrated with various applications to acquire enterprise data. Most common integration mechanism used is REST API. This also brings challenges in testing the application when it is connected to many other applications. Most of the time test is conducted to establish connection, validate the contract, checking if API exists as per the contract. Less often it is used to test the actual data in automation space.

Automation Challenges

The main challenges in testing such applications are:

With commercial and SaaS based API servers metering their API consumption, it incurs high cost to consume APIs for testing especially in load testing scenarios.

No control on external applications and downtime.

No control on network. Network issues block the connection.

Time consuming external calls

Testing of a system that has various integrations through APIs to other systems, is cumbersome due to the above listed challenges. Such system cannot be tested in silo also. The lack of good integration testing can lead serious failures in production.

Solution

This solution addresses the above mentioned challenges by developing a system that simulates external API Server to serve the API request but with closely matching data. Data from this system is not reliable and cannot be used for testing. However, this system helps in testing successfully the connections, input validation, interface contract validation, response format check, and response validation.

Benefits:

As this system is setup locally, it provides reliable network.
Eliminates API metering cost
Quicker with option choosing faster network for deployment
Downtimes can be controlled

This simulator system is built as server that receives requests from the application that is subjected for API integration testing.

Fig-1 : Simple Flow

Mock system is built as server to intercept the outgoing traffic from the application that is subjected for API integration testing.

Mock system components:

Frontend Server: Receives requests from the application and redirects to ML model serving component.
Data loader: Helps in gathering API data from various sources to feed to model for training.
Random data generator component
ML Model Trainer: To train with API requests and responses to predict accurate responses. Train data for this model is HTTP request parameters, and HTTP reposes parameters which includes both valid and invalid request data.
ML model serving component: This serves the outcome of the ML model and send the output to Response Data Builder.
Mock Response Data Builder: Builds HTTP response data in JSON format to send it back to application.
JSON Extractor: Extracts data values from JSON and converts to CSV file.
Data store: To store ML model and input data

Fig-2 : High Level Solution

Mock Response Builder

This is the core part of the system and here are the approaches adopted to solution the data generation. Adding more details here to make you understand why this is challenging.

API request and response comes with various types of data. This not just includes datatypes like String/text, Clob/large text. It has to support boolean, chars, numerals(int, long, double), alphanumeric, binary (for images, & files) and etc. API mocker should mock all these types of data in order to successfully fulfil the incoming request like the original server. Every data type poses different type of challenges to the mock system to generate it successfully. A simple random generator may not work well for all data types if results need to match the original server response.

1. Random Data Generator:

For most of the data types like boolean, numeric, chars, list(repeater), date, country, city etc. random generator solutioning is used. Randomness is a simple way to achieve content generation required in providing as part of response to the API request. Generating object identifiers (IDs) also done using random generator.

2. Machine Learning for Data Generation:

Machine learning plays crucial role in the solutioning. Response content generation based on the input request is powered by machine learning. Reading through the below solution overview is a must to understand the role of AI in this solution.

Evaluated models: Two parametric algorithms are tried in my short project. A set of labeled examples, with each example comprising of a set of feature values and a corresponding class label is used for training. Model uses it and learn a general rule to classify new examples that are presented to it later. This system is trained with data on external APIs mainly with text data. The scope doesnot cover image and file data types.

1. k-Nearest Neighbor (KNN)

2. Support Vector Machines (SVM)

SVM showed slightly better accuracy on the dataset I used for one small example. The data set contained text retrieved from mere 100 API request labels. I have not extensively tried with model tuning as the main focus was on building end to end API Mocker without much effort on AI.

Im planning to try out open-source pretrained(foundational) generative model in future as the trend shows more promising outcomes and removes the hassles of training.

Model Training

Below are the steps involved in model training:

Data Sourcing: Crawling target system APIs, Swagger like API documentation, List of APIs and input params in CSV
Pre processing and Data preparation:

Data collected is fed to CSV file format for each API request method. Seperate csv files are created for each API request
Data is classified as input parameter and target parameters

Data types used : Valid input values, User invalid input values, Boundary conditions
Training method:

Training with actual request parameters and responses

HTTP Methods to be supported : GET, POST, PUT and DELETE

Fig-3 : Training the model

Real time serving using model output

Configuration specifies the attributes required for data generation through model. When there is a need for text generation is identified, trained model is invoked. Data extracted from request JSON goes through validation phase. Data validation is done as per the trained data(from Swagger API) or manually fed documentation.

Fig-4 : Real time model serving

Response JSON generation

By taking the output of the above model, JSON is constructed.
Respond back with SON format

Most importantly, this application allows API call to original server in case it needs to be tested against it rather than mock server. There will be a toggle switch provided by adding the URL of the application to bypass the mock server and make direct connection to target API server.

Conclusion

The effectiveness of API testing is greatly increased when tests use realistic data, representative of real-world production conditions. Generating tests from production data must be done with care due to the risk of exposing sensitive data. Without automation, creating real-world useful tests is difficult to achieve at scale because of the high labor cost of combing through mounds of data, determining what is relevant, and cleansing the data of sensitive values.

This system is not meant for accurate testing of data, and should never be used when accuracy and precision of API response has to be verified. This is suitable mainly for interface contract testing when testing against original API server is posing challenges as mentioned above.

Wednesday, January 25, 2023

Processing Partial Valid Data in Real-time Application Integration

Introduction

Siloed applications across enterprise needs to be interconnected to leverage data produced in other trusted data source applications. This helps different business units collaborate better, reuse, analyze, make informed decision and ultimately add great value to their offerings. The process of synchronizing data across applications is crucial exercise in enterprises and significant effort spent by teams on building robust integration. There are many types of integration mechanisms followed in industry today to build robust systems interacting with one another. Still there are challenges that need to be addressed.

Data is tagged as invalid due to various reasons. It could happen in source, or in transit. Wrong way of processing, and network issues can make data invalid. Such invalid data causes real-time data dependent systems fail in delivering data to customers in time. The turnaround time to fetch the valid data in further attempts delays the reports delivery and analytics that in turn slows down making informed decisions. Im proposing a solution in this article to mitigate this issue.

Widely Used Approach

Let us first understand the commonly used sync mechanism. The main components that are part of any data synchronization are:

Data: Data is strategic asset to enterprises and comes in many forms. Structured, and unstructured being at the top of classification hierarchy, there are various other formats under this classification.
Source Application: A Data source which is trusted across enterprise
Middleware : Links two or more separate application. Provides common connection, orchestration, data transformation and mapping logic in integrating heterogeneous applications.
Target Application : Application that receive data from external sources and stores it.

Common approach in automated real-time integration system that suffers with time delay when encounters with invalid data :
Structured data is always subjected to validation testing for its each attribute when it migrated to new application. Receiving applications are designed and developed to reject whole record even single attribute in that record is invalid. Thus rejected record will come back to Source application where it has to be inspected and correction must be made either in automated way or manually. Manual intervention to correct the data consumes more time and error prone. This delay could make many reporting or analytics applications not depending on the invalid attributes loose time unnecessarily.

Step by step process:
1. Initial data load: Middleware receives the data and finds certain attribute empty
2. Handling invalid data: Middleware tries to find the data in other source systems if orchestration is enabled
3. Middleware processing: Middleware detects the record as invalid
4. Middleware rejects the data to source application
5. Data correction at source: Source application corrects the data and pushes it to middleware. This could take long time depending on the type of error. Most of the times, manual corrections require days to correct the data
6. Reconciled data in target system: Once corrected, data is pushed again to the middleware
7. Middleware validates and if valid, pushes the data to target application
8. Target application validates the data and if valid stores the data

Commonly used integration mechanism

Issues in Current Approaches

No integration is error free. Errors could happen due to various issues like computational factors, network issues, heterogeneity between applications and etc. These kind of errors can be minimized with careful design but cannot be eliminated as external factors play major role. Most of the integration systems build feature to handle invalid data as absence of such feature will lead to data mismatching, inaccurate computation, false reporting, mistrust, and huge cost. However, commonly followed standard approaches are not enough and still struggles with significant time delay while handling invalid data.

There are many challenges in achieving data sync across applications. Data is marked as invalid by receiving applications when it is incomplete, inconsistent, inaccurate or in wrong format.

Not every attribute in the rejected record may be required for every kind of reporting or analytics applications. Many types of reporting and computing can still run without certain attributes. Practice of rejecting whole record is unnecessary and adds latency to real time data synchronization systems. This in turn negatively impacts data dependent business in various ways.

My Solution Proposal

If you agree with the above stated issues, read further to understand my proposal. My solution involves design changes in middleware and target applications to accept record even when partial attributes are found invalid. The specific record with error needs to be stored in its present state. Meanwhile the error is notified to the source application with appropriate error message. The missed or wrong format data is imputed by middleware and sent to the target application so that it can allow data store. Valid attributes of the same record can still be used in further data process, computation or display to the users. Only the error attribute is flagged. This flag indicates that the particular attribute cannot be used until it gets corrected from the source application. Once the new attribute value coming from source application is validated, flag will be removed for that attribute. This way of allowing partial dataset to reside and get processed will help valid data available to computation without any unwanted delay.

High level component view

Invalid data flagging design

The flagging design idea has to be implemented to hold the invalid data with indicators on it.

1. Flagging invalid record to indicate which attribute failed to pass validation rules

2. Format would be : Name of the attribute. In case of multiple attributes, all those attributes names can be mentioned as comma separated values.

3. Flagging record to indicate what reports or process still can run with such invalid record

4. Flag should be used by reporting or processing logic to see if the attribute required is flagged for incorrectness

Design implementation

Scenario 1: With empty value

1. Initial data load: Middleware receives the data and finds an attribute empty

2. Handling invalid data: Middleware tries to find the data from other source applications if orchestration is enabled

3. Middleware processing: If value for the attributer is not found, middleware imputes the data by looking at historical data or predicting using AI

4. Middleware processing: Middleware flags this record as invalid marking the attribute

5. Middleware pushes the data to target application

6. Target system processing : Target application accepts the data along with flag

7. Record stored in target application: Target application stores record along with flagging the invalid attribute. This flagging helps the data consuming component not to use attribute with invalid value in computation or reporting.

8. Reconciliation by middleware: Middleware in parallel send the invalid record back to the source application asking for correction of the invalid attribute

9. Data correction at source: Source application corrects the data and pushes it to middleware

10. Reconciled data in target application: This corrected data without flag is pushed to the target application

11. Data correction at target application: Target application validates the data and removes the flag for attribute

Scenario 2: With wrong format or inaccurate value

1. Initial data load: Middleware receives the data and finds an attribute is having wrong data format

2. Handling invalid data: Middleware flags this record as invalid marking the attribute

3. Middleware processing: Middleware pushes the data to target application

4. Target application processing : Target application accepts the data along with flag

5. Record stored in target application: Target application stores record along with flagging the invalid attribute. This flagging helps the data consuming component not to use attribute with invalid value in computation or reporting.

6. Reconciliation by middleware: Middleware sends the invalid record back to the source application asking for correction of the invalid attribute. Also in parallel tries to get the data through other data sources.

7. Data correction at source: Source application corrects the data and pushes it to middleware

8. Reconciled data in target application: This corrected data without flag is pushed to the target application

9. Data correction at target application: Target application validates the data and removes the flag for attribute. This allows the reporting and analytics functionality to utilize the attribute for further processing.

Process Flow

Data Format

Conclusion

Following type of applications can benefit with the proposed design
1. Real time integration of applications

2. Integration of transactional systems where each data event needs to be captured and propagated to other system without delay
3. Structured data migration to other applications
4. Real time reporting and analytics applications which depends on external applications for data

5. Applications interacting in Hub and Spoke design pattern

Friday, January 20, 2023

Should chaos engineering practice be made part of Devops CI/CD pipeline!!!

Answering this question with just Yes or No would be tough. At first glance it looks like straight forward and possible. There are tools like Gremlin, Litmus or Chaos Monkey to simulate different types of failures, and then integrate these tools into your pipeline so that they are automatically run as part of your testing and deployment process. However, Its not as simple. Let us deep dive to see what are the challenges for including these fault injection experimentations into CD pipeline.

We need to start with basics of Chaos Engineering and the purpose of Devops to arrive at the detailed answer. We must start with the official definition:

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Next we need to look at the principles. Principles of chaos engineering is well defined to help users understand the practice in detail. it includes:

Good understanding of the system
Including stakeholders
Formulating hypothesis
Preparing experiments
Planning for game day
Running experiments
Monitoring
Collecting metrics
Increasing blast radius

Automating the fault injections using Devops pipeline would miss certain aspects of principles stated above. Monitoring the system behaviour on a scheduled day to run these experiments together with all the stakeholders is critical to the success of this practice. The learning that comes from live monitoring and analysis along with other stakeholders provides better perspective on the issues that are uncovered. It also helps in detecting the weak points in the system.

Simple fault injections whose results are easily guessable without the need of monitoring may be considered as better candidate for CD pipeline. But there are not many such simple faults. Even simplest faults injected can manifest beyond imagination to become complex issue within the ecosystem. Monitoring system at every layer with every possible means is important while flats are being injected. Today’s distributed systems are complex with multiple layering and interdependent modules. Same yard scale cannot be used in every stage.

Game day planning includes activities that cannot be automated easily. The fire drill concept, team gathering, and war room setup are some of the activities that must be experienced by every chaos engineering practitioner. These rehearsals build confidence to deal with the real life issues. Loosely based name sake automation could fail the preparedness objective. The real life chaotic surprises requires greater people collaboration skills to spring back at the earliest proving resiliency. Mere dependency on the rigid CD process without the possibility of eternal evolving would not yield better results.

After every iteration of the fault injection exercise, teams must plan for increasing blast radius and learn more. Automation with same input criteria/parameters for every run would fail the objective of continuous improvement. Adding variable time in increasing manner is not tough automating. However, that requires complex logic inclusion and becomes unsustainable sooner than you think. Devops CD pipelines are not designed for handling such highly variable, time consuming and human intervention demanding process.

Time is crucial parameter and with chaos engineering practice, every minute variation could increase possibilities of uncovering the new issue. The main motto of automation is to save on time. However, chaos engineering requires uses time as varying parameter to alter the characteristics of the fault injection and subject the system to fail using time as catalyst. Also, running time consuming process in Devops CD process is not recommended. This slows down the deployment process and fails the modularity aspect with increasing dependency.

Conclusion

The above details with definition and principle establishes that this practice is nothing but detailed experimentation and not just mere testing that can be simply automated. Chaos engineering is about uncovering the inherent chaos in the system using unique approaches in every trial. Without the variable parameters, human collaboration, and real time monitoring, the process may not serve the real purpose of chaos engineering.

Sunday, November 6, 2022

Chaos Engineering Clarity

My recent conversations with technical community made me feel that Chaos Engineering(CE) concepts are being misunderstood. This being emerging practice in the industry, very soon deeper understanding would eventually clear all the confusions. There are comprehensive documentation written about this practice already defining its objectives, principles, and implementation. These documentation are widely available through various open source and commercial CE tools available in market.

Technical folks new to this practice are having doubts and confusions on this practice. I believe further clarities need to be provided on the CE practice widely. I have gathered many questions through my interaction with technical teams. I'm trying to address some of them in this blog post.

Before I begin, you should read through the chaos definition. As per wikipedia:

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production

Is chaos engineering practice against compliance and regulation?

No! In fact, the CE experiments help in meeting security and compliance standards by uncovering the hidden issues. Compliances are setup generally for ensuring data privacy, data protection, security, country/regions specific laws and following standard processes.

The objective of chaos engineering is to have deeper understanding of IT systems and eliminate or minimize the application/service outage. The tests by injecting faults are supposed to be done in a controlled manner where rollback plan is also part of it.

As far as the tools usage, just like any third party tools that are used in the applications, CE tools also required to be complaint with security and licensing standards of an organization. The scans that enterprises mandate usually checks for all the CVEs in third party tools.

Is chaos engineering replacement for security related tests like static and dynamic/penetration test practice?

The general CE practice is not meant to replace the security tests like static, dynamic or interactive application security tests (SAST, DAST or IAST). Both security testing and CE practice includes injecting fault into the system. The method of analysing the system by trying to break is same in both types of practices but objectives differ.

I see attempt being made to extend CE practice to cover certain types of security testing. There is even a new name Security Chaos Engineering coined for this purpose. They use the same fault injection approaches to inject security flaws into the system. However, one should not confuse general CE practice with security testing. Their objectives are different even if the same tool is used for both types of practices.

Is chaos engineering suitable only for cloud infrastructure?

Any type of infrastructure that hosts your applications can be considered. Whether it is cloud, containerized, virtual machines or physical servers infrastructure. The widely available examples in the documentation of CE provides more examples on cloud infrastructure. That doesn't mean it is only applicable for cloud infrastructure.

Can it be made part of automation through CI pipelines?

Automation of injecting faults at various layers like application, network, and computing resources is quite possible. However, CE process require observability through instrumentation which is better if carried out manually together with all stakeholders. Automating the systems breaking process could fail the objective of deeper understanding on the system and its behaviour when faults are deliberately induced.

The crucial phase of CE practice is to have a game plan day (mock drill) where faults are injected and systems behavior is captured. This would help the teams to be prepared when such real time situation arises in future.

How is chaos engineering different from load testing or performance testing?

Load testing objective is to determine the system performance under load and benchmark it. This helps teams in understanding the system capacity and design the system accordingly. The scalability requirement of the system is identified and fulfilled with the help of performance testing.

This is entirely different from the objective and approaches of CE practice. However, CE practice can include certain load testing tools to further test the behavior of the system under the load while fault is injected. This apart, there is no other connection between CE practice and performance testing. They remain different in its objectives and approaches.

Is it necessary to use chaos engineering tool?

Tools help you to get started quickly on achieving the CE objectives. There are various open source and commercial tools available in the market. These tools provide make readymade fault injections available to you with easy setup. Most of the tools comes packaged with resource exhaustion tests (RAM, Disk, I/O), network test(Latency), Infrastructural(pods, containers, VM, server) or application level tests.

Writing these tests from scratch requires lot of time and effort. I would recommend to use one of such tools if you are new to CE practice. Once you are proficient, you can start building your own tool.

Hope this post helps you. Thanks for reading and look forward to your feedback.

References

http://principlesofchaos.org/

https://arxiv.org/abs/2006.04444

https://www.ibm.com/cloud/architecture/architecture/practices/chaos-engineering-principles/

Friday, September 23, 2022

Software Architecture in DevOps Age

I started my architecture journey designing and solutioning a monolithic system. It was the era of layering software architecture where system was separated as data layer, business layer, application layer, and presentation layer. Lengthy design phases for software architecture was a norm back then. I as an Architect, mainly involved in collecting high-level requirements, creating governance model, reviewing enterprise standards, documenting, communicating through architecture diagram, identifying design patterns, and designing components to use in the software development process. Software development only started after all that elaborative design process. There was a strong belief that architecture and design must completely end to start the implementation process. The high coupling resulted by this model created dependencies between teams and eventually slowed down the deliveries.

During 2013-14, when our team introduced to Agile, we had concerns and questions. As an architect my apprehension was to know if traditional architecture fit in agile space. The concerns I had were:

How to refactor if latter sprints changes architecture?

Will short term plan introduce major structural changes in future? How to handle them?

How can architect effectively communicate with small but more number of Agile teams?

How do we get long term client requirement road map visibility?

Service Oriented Architecture

As our product was much stable and there was not many architectural changes till the end of 2014, newly introduced Agile teams had not many challenges like architecture refactoring. This period of transition helped me to work on my fears. During this time, I got introduced to domain driven design which helped me decomposing the layers system further into services. This was my first level discovery to work on multiple functional aspects in parallel which would fall into different logical categories. Helped me in shifting from large programs toward multiple autonomous teams.

DevOps Journey
The year 2017 got an opportunity to architect a large scale solution from scratch. I started this work with group of Architects, and Engineers who were ready adapting to Devops model.
The architecture process I explored with DevOps model are:
1. Priorities : Planning scope was primarily focused on the our high priorities based on assessed business capabilities. This helped us to start with simple solution and refining that incrementally and iteratively. The essence was to do enough architecture to get through the next sprint. The architecture design is iterative based on comments from the planning ecosystem of organization and also based on new information and changes that may occur in the organization’s environment while planning and architecture are occurring.
2. Automation Everywhere : Right from infrastructure provisioning till running testcases helped removing error prone manual efforts
3. User Oriented Design : By focusing on the user journey, it helped collecting the non functional requirements.
4. Collaboration : As an Architect, I was part of every squad owning their delivery commitments.
5. Architecture Review Board : The board comprising of Architects, and Product Owners as committee members, did review every new feature and epic to provide 360 degree feedback and perform impact analysis.

Paradigm Shift in Design Process

Thinking from old styled complete up-front design to priority based minimum viable architecture brought various changes in the design process. The challenges in terms of non clarity in roadmap, unknown usage requirement, changing requirements, cost and system’s evolvability introduced risks in the design. This called for a design that provides cushion to all the above challenges. The design strategies helped are:

Separation of concerns: Separating a software system into distinct solutions, such that each section addresses a separate concern

Modularization: Decomposing a system into modules driven by information hiding and separation of concerns

Loosely coupled interfaces: Interaction of the systems were based on open standards like API to reduce the interdependence between the systems.

Event driven: Real time data flow between the loose coupled systems

Distributed Systems: Taking full advantage of modern multi core processor technologies, systems are distributed to run concurrently to support horizontal scaling and elasticity to varying workloads.

Non-functional requirements:Considering important aspects of non functional requirements is the key in designing a system for long term with minimal core changes.

Meta-modeling: Modeling the concepts and relationships of a modeling language/notation

Augmented Intelligence: Rule engines to lower the cost of changing the behavior of the system

Technology to adapt Devops effectively
It would have been harsh Devops journey without the support of great set of modern technology and tools. Some of the tools immensely helped me are:

Cloud Native Stacks
Containerization
Test Automation Tool
Pipeline Management Tool
Code Scanning Tools
Deployment Automation Tool

Out of control areas

There are numerous aspects that are not under direct control of Architects. Changing business dynamics, customer interests shift, disrupting technologies etc. can happen anytime and architecture should be able to consume it with minimal refactoring effort. Few areas that I experienced design refactoring are for:

Core feature replacements or new additions
Obsolete technologies replacement
The sunset of external system that we were depending for data

Summary

Just like everyone in the Devops team work across the entire application lifecycle, from development to deployment, Architects also plays key role in every aspects of software lifecycle in DevOps culture. As an Architect, by managing change and complexity, I ensured the objective of Devops to deliver the software end product quickly and efficiently are successfully met. As I mature with Devops, focus is low on the tools, automation, and orchestration. Instead, it is more about communication, collaboration, and a collective effort to remove bottlenecks.

Cool DevOps industry leaders I follow
@danielbryantuk
@JayneGroll

References
Devops handbook
https://www.amazon.com/DevOps-Handbook-World-Class-Reliability-Organizations/dp/1942788002
continuous-architecture.org
https://continuous-architecture.org
DevOps Report
https://www.ipexpoeurope.com/content/download/10069/143970/file/2017-state-of-devops-report.pdf

Sunday, July 31, 2022

The Role of Web Application Firewall(WAF) in Security

“A web application firewall (WAF) is a specific form of application firewall that filters, monitors, and blocks HTTP traffic to and from a web service.” -Wikipedia

According to the PCI DSS Information Supplement for requirement 6.6, a WAF is defined as “a security policy enforcement point positioned between a web application and the client endpoint.

WAF is an application level firewall that is commonly used to protect web applications. It is located in front of web applications to monitor HTTP traffic coming from internet. It is used for detecting and blocking malicious requests in real time. It forms the first line of defence to protect web environment of users or companies.

Types of WAFs

WAF functionality can be implemented in software or hardware, running in an appliance device, or in a typical server running a common operating system. It may be a stand-alone device or integrated into other network components.

Hardware WAF

This type of WAFs comes as part of hardware appliance which can be deployed in the local network where main web servers would be running. This device comes with its own computing resources and suitable for websites that handles heavy traffic.

Software WAF

This software WAF installed normally in a virtual machine setup and maintained. It is much cheaper and flexible compared to hardware WAF but the throughput could be slower than it.

SaaS WAF

This type is managed by cloud service provider and there is no maintenance overhead as it it takes care by service provider. Optimising, patching, and managing is done by cloud service provider. The ease of use and lower cost are the advantages of it.

Core Capabilities of WAF

These are must-to-have features that most of the WAF supports and some commercial ones offer many more advanced features.

Reverse proxy for intercepting the incoming traffic

This is the most crucial feature that every WAF must support. Every incoming request tower server is first intercepted by the WAF which works exactly like reverse proxy.

Rule based logic, Parsing and signatures

Rules or Policies specifies what WAF needs to look out for. They are specific samples in web traffic in the incoming data stream. They also include the blocking action to take on detection of an attack attempt.

Protection against OWASP top 10 security flaws

At a minimum, WAFs must detect the OWASP listed top 10 attacks. The OWASP Top 10 is a standard awareness document for developers and web application security. It represents a broad consensus about the most critical security risks to web applications.
The OWASP produces a list of the top ten web application security flaws.

[Picture courtesy:OWASP]

Configurable for covering new attacks

Customizable for detecting new types of attacks. Users should be able to customize the rules with simple configuration. This feature help users to modify the configuration on demand.

Blocklists and Allowlists

The feature supports both positive and negative security model against known attacks

Logs for data analysis

Logs helps users to debug and analyze the data stream

Advanced Features

There are many advanced features being offered by commercial WAFs to add value to their offerings.

DDOS protection

Protection against denial of service attacks

UI Console

Intuitive dashboard user interface for viewing stats and other reports. It can used for quick data analysis as well.

Threat intelligence

AI-based machine learning to detect suspicious activity. Detects the latest hacker attack strategies by identifying hacking patterns.

Failover protection

As WAFs become bottleneck and single point of failure in the whole ecosystem, this feature ensures high availability. By handling failure, it rolls new WAF instance in case of crash.

High HTTP throughput

Faster assessments of wide variety risks using distributed WAFs help maintaining good throughput.

Sensitive data protection

This feature alerts on responses containing sensitive data

Plugin to existing web servers

Certain web servers allow extensions to play along to help users extend the capabilities. WAF as plugin to servers make it uniform and easy to configure.

Brute force attack prevention

Protection against brute force is a feature that WAFs use to protect against attacks by automated tools that runs successive attacks to gain control.

Attack analysis

Helping users analyzing attacks adds high value to WAF offerings.

Continuos upgrades

WAFs must continuously upgrade to tackle the new attack types. Every year, there were thousands of new attacks detected. More than 3000 new vulnerabilities are discovered in 2021 year alone.

Is WAF Silver Bullet?

WAFs can only detect attacks at HTTP layer and not in other layers. For example, at network layer there should be separate network firewalls and IPS(intrusion prevention systems).

Inspite of the numerous features, enablers, and detection techniques, there are various tools an techniques used to bypass WAFs today. Some of the known approaches used by hackers to bypass WAFs are browser emulation, obfuscation, encodings, and payload characters modification. As WAFs rules and policies are configured mainly based on regexp, hackers figure innovative ways to bypass it by modifying payloads. There are automated tools used by hackers to speed up the process and tools help them to find out the vulnerable areas inside WAFs.

Conclusion

WAF is not a silver bullet and hackers continuously find new ways to break its protection. One can't relax just by introducing WAF in the infrastructure. The protection process is never ending with everyday hackers finding out new ways to break in. It requires continuous effort to keep updated on the latest security vulnerabilities and upgrading the system for it.