What is Chaos Engineering?


Definition “Chaos Testing” What is Chaos Engineering?

Not every challenge that software and distributed systems have to face can be predicted. But with Chaos Engineering problems can be specifically tested. The result is a more reliable system in an emergency.

Companies on the topic

A number of chaos engineering tools are based on the idea of IT specialist Antonio Garcia Martinez that monkeys use software and provoke errors.A number of chaos engineering tools are based on the idea of IT specialist Antonio Garcia Martinez that monkeys use software and provoke errors.

In software development, great attention is paid to developing reliable, safe and reliable systems. In English, this is accompanied by the term “resilience”, in German elasticity, resilience or robustness.With unit and integration testing, developers work on the reliability of their software, but in real scenarios of distributed systems, these methods reach their limits.

Modern systems have so many components and complexities that they can hardly be covered by regular development methods. Chaos Engineering takes a different approach: a targeted attempt is made to break the system and cause errors in order to determine how the systems react in unexpected situations.

Chaos Engineering-the digital chaos theory with the prominent patron

The technology of chaos engineering has been driven forward in recent years mainly by the US streaming outstaffing service Netflix. Even if Netflix is not as associated with digital progress as Apple or Microsoft, the company has a gigantic digital infrastructure and makes sure that it works flawlessly.

Chaos Engineering provides a development model in which unexpected scenarios are tested and software is driven to its limits and beyond. In contemporary systems, there is an unmanageable and growing complexity for developers, which can call up unpredictable problems at any time.

This is also based on a rethinking, away from the development model, in which no breakdowns are the norm, to the thinking that a crash is unavoidable. Through the technology of Chaos Engineering, more targeted redundant systems can be created, so that the end customers are no longer affected by errors.

A simple example explains: a system has been built for a certain maximum number of calls per second. How does this system react when the maximum number is reached and exceeded? How does which part of the software react at which points? Not all tested scenarios need to be everyday-oriented, an exciting part of chaos engineering is the development of hypothetical scenarios.

How does Chaos Engineering work in practice?

Very roughly, the process of chaos engineering is based on experimenting with the limits of a software. For this purpose, a stable state is first defined in which a system works as defined as normal. A control group ensures that it is the chaotic scenario that influences the system, the control group continues to operate outside of this test scenario.

Problems are now introduced into the test scenario (server crashes, defective hard drives, failures, overload scenarios). Chaos Engineering is primarily about testing the handling of these problems. How can the limits of a system be extended? What redundancies are necessary to keep the existing service running? And how can given weaknesses be eradicated?

In order to gain more confidence in a system, it is essential to limit errors and maintain the stable system states as much as possible. This is precisely why it is essential to suspend systems to critical tests and to deliberately cause errors.

Simian Army-a software example for Chaos Engineering

To carry out these tests in practice, Netflix relies on the Simian Army software. This bug suite simulates (in the words of IT technician Antonio Garcia Martinez in the book “Chaos Monkeys”) a situation in which monkeys use the software and cause various bugs. Corresponding names carry different testing tools.

Chaos Gorilla about disables an entire availability zone in the server infrastructure, Chaos Kong even disables an entire region in Amazon’s AWS infrastructure. Byte Monkey tests sources of error in the Java code of JVM applications, and Latency Monkey tests communication delays such as those experienced in network outages.

The complexity of these induced errors already shows in the excerpts mentioned how large-scale errors can occur in distributed systems. In addition to Simian Army, tools such as SIMOORG (open source) or Monkey Ops (software implemented in Go) are also used.

Of course, the end customer does not notice anything of the complexity that underlies even the smallest operations and should not notice them either. Failure always means a potential loss of revenue for companies, and chaos engineering is just one way to make failures less likely.

Chaos Engineering ensures this in practice, resulting in more fault-tolerant software and more satisfied customers. For IT developers, Chaos Engineering also offers an attractive method to test sometimes realistic, sometimes bizarre error scenarios beyond software development. This blurs the line between software development and quality assurance to create more stability and resilience.


Ready to see us in action:

More To Explore

Enable registration in settings - general
Have any project in mind?

Contact us: