Skip to main content

Chaos engineering

Chaos engineering is the process of testing a distributed computing system to ensure that the system can withstand unexpected disruptions in function. It is so named because it relies on concepts from chaos theory, which focuses on random and unpredictable behavior. The goal of chaos engineering is to continuously conduct controlled experiments that introduce random and unpredictable behavior in order to discover weaknesses in a system.

In computing, a distributed system is any grouping of computers that are linked over a network and share resources. Distributed systems can break when unexpected conditions or situations (such as an unintentional change from an intentional update) occur.  Large distributed systems have complex and unpredictable dependencies between components, which can it difficult to troubleshoot an error. This is where chaos engineering comes into play. Chaos engineering identifies "what if" scenarios that aim to trigger failures, so that the system owners can evaluate the performance and integrity of the software.

How chaos engineering works

Like stress testing, chaos engineering aims to improve a system's or network's design by discovering and correcting its weaknesses. The process is typically divided into several steps and starts with the establishment of a baseline. The testers must first identify how the system should operate under optimal conditions and specify what constitutes a normal working state. Next, they must consider one or more potential weaknesses and formulate a hypothesis about the effects of those weaknesses.

Chaos engineering can be performed for a program not yet launched, and much can still be learned; however, testing on real-world conditions yields the most accurate results. For this reason, chaos engineering is often performed on production systems, especially when it is too cumbersome or expensive to duplicate a large, distributed system environment just for testing purposes. Naturally, this also means that chaos engineering can be highly disruptive. Success with the chaos engineering paradigm demands close communication and coordination between IT staff and developers and across business units. Experiments are rarely run during peak time to avoid giving customers a negative experience.

Chaos engineering tools

Chaos engineering is a relatively new approach to software testing and software quality assurance (QA). Netflix was a notable pioneer of chaos engineering, among the first to formalize how to use it in production systems. Netflix designed and open sourced automation platforms for chaos tests, including Chaos Monkey, Chaos Gorilla and similarly named tools, collectively dubbed the Simian Army. 

Comments

Popular posts from this blog

Black swan

A  black swan event  is an incident that occurs randomly and unexpectedly and has wide-spread ramifications. The event is usually followed with reflection and a flawed rationalization that it was inevitable. The phrase illustrates the frailty of inductive reasoning and the danger of making sweeping generalizations from limited observations. The term came from the idea that if a man saw a thousand swans and they were all white, he might logically conclude that all swans are white. The flaw in his logic is that even when the premises are true, the conclusion can still be false. In other words, just because the man has never seen a black swan, it does not mean they do not exist. As Dutch explorers discovered in 1697, black swans are simply outliers -- rare birds, unknown to Europeans until Willem de Vlamingh and his crew visited Australia. Statistician Nassim Nicholas Taleb uses the phrase black swan as a metaphor for how humans deal with unpredictable events in his 2007...

A Graphics Processing Unit (GPU)

A graphics processing unit (GPU) is a computer chip that performs rapid mathematical calculations, primarily for the purpose of rendering images. A GPU may be found integrated with a central processing unit (CPU) on the same circuit, on a graphics card or in the motherboard of a personal computer or server. In the early days of computing, the CPU performed these calculations. As more graphics-intensive applications such as AutoCAD were developed; however, their demands put strain on the CPU and degraded performance. GPUs came about as a way to offload those tasks from CPUs, freeing up their processing power. NVIDIA, AMD, Intel and ARM are some of the major players in the GPU market. GPU vs. CPU A graphics processing unit is able to render images more quickly than a central processing unit because of its parallel processing architecture, which allows it to perform multiple calculations at the same time. A single CPU does not have this capability, although multi...

6G (sixth-generation wireless)

6G (sixth-generation wireless) is the successor to 5G cellular technology. 6G networks will be able to use higher frequencies than 5G networks and provide substantially higher capacity and much lower latency. One of the goals of the 6G Internet will be to support one micro-second latency communications, representing 1,000 times faster -- or 1/1000th the latency -- than one millisecond throughput. The 6G technology market is expected to facilitate large improvements in the areas of imaging, presence technology and location awareness. Working in conjunction with AI, the computational infrastructure of 6G will be able to autonomously determine the best location for computing to occur; this includes decisions about data storage, processing and sharing.  Advantages of 6G over 5G 6G is expected to support 1 terabyte per second (Tbps) speeds. This level of capacity and latency will be unprecedented and wi...