Principled testing ~ James Akl

Testing software-intensive systems involves more decisions than the standard vocabulary suggests. The conventional taxonomy (unit, integration, system, acceptance) organizes tests along a single axis of scope. But each test also involves deciding why to run it (intent), where to run it (environment), and how to judge correctness (the oracle: the mechanism that decides whether an output is right). Separating these decisions can clarify what a test strategy covers and where the gaps are.

This article builds a more complete picture of what testing involves: the gap between structural and behavioral correctness; the oracle challenges that intensify in systems that learn from data or interface with the physical world; the range from ad hoc exploration to formal verification; and the reality that most teams work under constraints where the question is not whether to test more, but what to invest in.

Conventional frameworks

Unit → integration → system → acceptance. The “test levels,” ordered by scope. Myers (1979) introduced foundational ideas about test levels in The Art of Software Testing; the ISTQB syllabus later codified the four-level hierarchy as the primary organizing axis.
The V-model. Requirements map to acceptance tests, architecture to integration tests, detailed design to unit tests. Standards like ISO/IEC/IEEE 29119 define generic test processes compatible with this and other lifecycle models.
The test pyramid. Cohn (2009) proposed a cost heuristic: many unit tests, fewer service-level tests, fewer still UI tests, as described by Fowler (2012). Variants include the testing trophy (Dodds, emphasizing integration tests) and the testing diamond (wider in the middle, for service-oriented architectures).
Test-driven development. Beck (2002) formalized writing tests before implementation: red, green, refactor. TDD is a design discipline as much as a testing practice; it forces interface decisions before implementation. Its strength is in domains where the oracle is clear and the specification is stable enough to write a test against.

These frameworks capture real tradeoffs. They also compress distinct decisions into a single label. “Integration test” means one thing in embedded systems (testing interfaces between hardware and software components) and something different in web development (testing interactions between services). “Hardware-in-the-loop” has a precise meaning in automotive and aerospace (real controller hardware running real firmware, with a simulated plant model, in real time) but is used loosely elsewhere for any test involving physical hardware. This is not a failure of the vocabulary; these terms serve their communities well. But they fix one dimension and leave the rest implicit, and different communities fill in those gaps differently. Decomposing a test into its constituent decisions can help bridge this ambiguity.

A note on terminology. Adjacent disciplines use overlapping but distinct terms for related activities. Verification asks whether the system meets its specification; validation asks whether the specification meets the actual need. These are often paired as V&V (IEEE 1012, DO-178C). In modeling and simulation, particularly in defense contexts, this extends to VV&A (Verification, Validation, and Accreditation), where accreditation is the formal determination that a model is acceptable for a specific use. Evaluation is used broadly in machine learning for model assessment (often shortened to “evals”). Qualification and certification appear in hardware, pharmaceutical, and aerospace contexts for formal attestation against regulatory criteria. This article uses “testing” broadly to encompass the act of executing a system and judging its outputs, which overlaps with all of these but is not synonymous with any of them.

Dimensions of a test

A test involves several decisions that can, to a useful degree, be considered separately: Scope (what is being tested), Intent (why the test exists), Environment (where it runs), and Oracle (how correctness is judged). These four are not exhaustive or perfectly independent — they interact, and there may be others worth identifying (test input generation strategy, for instance, interacts with all four but is not reducible to any of them). But conflating any two can cause confusion.

Scope

What is the system-under-test?

Component (a single function, class, or module)
Subsystem (a pipeline, service, or bounded context)
System (the full deployed artifact)
System-of-systems (multiple interacting systems, including dependencies outside the team’s deployment authority)

Scope is about boundaries, not code size. What counts as a unit depends on where the architecture draws its interfaces. Meszaros (2007) formalized the taxonomy of test doubles (stubs, mocks, fakes, dummies, spies) in xUnit Test Patterns precisely because scope decisions require explicit management of what lies outside the boundary.

Intent

Why is this test being run?

Regression (did existing behavior break?)
Progression (does the new or changed behavior work as specified?)
Exploration (what does the system actually do in this scenario?)
Acceptance (does the system meet requirements?)

Regression and progression are complementary. A code change affects two surfaces: existing behaviors that might break, and new behaviors that must work. A progression test, if retained, becomes a regression guard once the behavior it verified is established. The distinction matters for test design. To illustrate: adding concurrency to a module calls for progression tests for the new thread-safety requirements and regression tests for existing sequential behavior. These tests have different oracle and environment requirements, since concurrency bugs are timing-dependent and may not manifest in-process. Rothermel and Harrold (1997) formalized safe regression test selection, showing that the problem has structure independent of the other dimensions.

On “progression testing.” “Regression testing” is standard across ISTQB, IEEE, and ISO vocabularies. “Progression testing” is less universal — it originates from the TMap methodology (Sogeti), where it is formally defined as testing new or adapted functionality. ISTQB uses “confirmation testing” for a related but narrower concept (specifically re-testing a defect fix to verify the fix works, rather than testing new features generally). The term “progression” is used here to emphasize the symmetry between checking that old behavior survived and checking that new behavior works.

Exploration includes interactive, ad hoc testing — someone running the system in varied conditions to see what happens. It does not produce a persistent automated artifact, but it consumes budget and produces information. In early development, when interfaces are still changing, it is often the most appropriate form of testing.

In regulated domains (e.g., DO-178C for avionics, ISO 26262 for automotive), acceptance testing becomes certification testing: the regulatory framework prescribes specific evidence and coverage criteria, constraining the other dimensions (e.g., DO-178C Level A requires MC/DC structural coverage; IEC 61508 SIL 4 highly recommends it).

Environment

Where does the test execute?

In-process (same process as the test runner, with mocked or stubbed dependencies)
Simulated (against a model (e.g., physics, networks, patients) with varying fidelity; sometimes called in-silico testing)
Hardware-in-the-loop (real hardware running real firmware, with simulated stimuli)
Field (real hardware, real world)

Environment determines what phenomena the test can observe. An in-process test cannot observe real-time timing faults. A simulation cannot reproduce all sensor characteristics. Each environment has a fidelity envelope: the set of phenomena that can, in principle, manifest during execution.

graph LR
    A["In-process"]
    B["Simulated"]
    C["HIL"]
    D["Field"]

    A -->|"+ model"| B
    B -->|"+ real<br>hardware"| C
    C -->|"+ real<br>world"| D

    style A fill:#1a1a1a,stroke:#60a5fa,stroke-width:2px,color:#60a5fa
    style B fill:#1a1a1a,stroke:#86efac,stroke-width:2px,color:#86efac
    style C fill:#1a1a1a,stroke:#fbbf24,stroke-width:2px,color:#fbbf24
    style D fill:#1a1a1a,stroke:#f87171,stroke-width:2px,color:#f87171

    classDef default font-family:Source Code Pro

“Simulated” spans a range. In automotive and aerospace, this is formalized: model-in-the-loop (MIL — testing a model of the software before code generation), software-in-the-loop (SIL — compiled production code against a simulated environment), and processor-in-the-loop (PIL — real code on the target processor, simulated I/O). MIL cannot reveal bugs introduced by code generation; SIL cannot reveal bugs caused by the target processor’s timing. In domains where these distinctions are formalized (e.g., ISO 26262), they warrant separate treatment.

Much of practical test engineering involves shifting tests along the environment dimension. Recording HTTP interactions for replay, mocking database connections, or using containerized dependencies (e.g., Testcontainers) all convert what would be an integration test against real services into a faster, more deterministic in-process or simulated test. This gains repeatability and speed but can lose fidelity: the mock may drift from the real dependency over time, and bugs that arise from the interaction with the real service (e.g., latency, partial failures, version mismatches) become invisible. Contract testing is one response: verify that the mock and the real service agree on the interface, so that the fidelity loss is bounded. The tradeoff is general: moving a test to a cheaper environment generally narrows the fidelity envelope, and the question is whether the phenomena lost are ones that matter.

Chaos engineering pushes in the opposite direction — deliberately injecting faults (e.g., network partitions, service failures, resource exhaustion) into a production or production-like environment to test resilience. In the dimensional framework, this is exploration or progression intent at system scope, in a degraded field environment, with property oracles (“does the system degrade gracefully?”). It explicitly targets the phenomena that convenient testing cannot observe.

Oracle

How is correctness judged?

Exact (a specific expected output, known in advance)
Differential (compare against a reference implementation or a previous version)
Property (output satisfies a constraint, without specifying the exact value)
Statistical (output distribution satisfies criteria over many runs)
Human (a person evaluates the output)

Exact and property oracles differ in specificity: an exact oracle says “this is the answer” (e.g., output == 42); a property oracle says “any answer satisfying this constraint is acceptable” (e.g., output > 0 or is_sorted(output)). The boundary is not always sharp; a floating-point comparison with a tolerance (abs(output - expected) < epsilon) looks like an exact oracle but is technically a property oracle, since it accepts a range of values. Non-deterministic systems (e.g., concurrent programs, stochastic algorithms) push further: if running the same test twice can produce different results, exact oracles on full system output become unreliable, and property or statistical oracles are often more appropriate. Despite the blurred boundary, the distinction matters: exact and property oracles have different failure modes and different sensitivity to system changes.

Differential oracles deserve attention. Running the same inputs through the current and previous version of a system and comparing outputs is a widely used technique (sometimes called golden-file testing, snapshot testing, or characterization testing in the legacy code context). Its interpretation depends on intent: for a pure refactoring (no intended behavioral change), any output difference is a regression. For a feature change, some differences are expected (new behavior working as intended) and others are not (existing behavior inadvertently broken). The differential oracle reports all differences without distinguishing between them; a human or further test must decide which are acceptable. This makes differential oracles powerful for regression detection but insufficient on their own for progression testing.

The oracle is often the most underspecified dimension. A test can verify that a function returns the correct type and structure without necessarily checking whether the content is correct. The gap between structural correctness (right types, right interfaces) and behavioral correctness (right outputs under the right conditions) is where many real failures live. A perception module can pass all its unit tests — correct interfaces, correct data formats, correct latency — and misclassify objects in deployment, because the tests verified structure rather than behavior. Barr et al. (2015) surveyed the oracle problem and found it pervasive across software testing.

Metamorphic testing (Chen et al., 1998) sidesteps the need for a complete oracle: instead of specifying the expected output for input $x$, specify a relation between outputs for related inputs (e.g., rotating an image should not change which objects are detected). This requires domain knowledge to identify useful metamorphic relations, but it is one of the few oracle strategies that scales to systems without deterministic expected outputs. Fuzzing takes a complementary approach: generate large volumes of inputs (often random or mutation-guided) and check for violations of broad properties like “does not crash” or “does not trigger undefined behavior.” Fuzzing is powerful for finding faults without specifying expected outputs, though its property oracles are typically coarse.

Oracles for learned systems

Machine learning systems, from classical models to large-scale foundation models, present the oracle problem in especially sharp form. The challenge is not limited to any one model architecture; it is inherent, arising from the nature of learned behavior itself.

The held-out test set is a statistical oracle at system scope: it estimates generalization beyond the training data. But it only estimates generalization to the test distribution. A model that performs well on a curated benchmark can fail on data from a different sensor, geography, or season. This gap (between “passes the test set” and “works in deployment”) is structurally similar to “passes in simulation” versus “works on hardware”: an environment mismatch between where the test runs and where the system operates.

Benchmark saturation compounds this. Widely used benchmarks can saturate (state-of-the-art models can exceed 90% accuracy), which reduces their discriminative power and drives a cycle of replacing benchmarks with harder ones. The benchmarks remain useful for baseline comparison, but high scores on them say progressively less about deployment readiness. A related concern is data contamination: models trained on internet-scale data may have encountered benchmark items during training, inflating scores without a corresponding improvement in capability.

Generative models intensify the oracle problem further. When the system’s output is natural language, code, or other open-ended artifacts, there is often no single correct answer. The same prompt can have many valid responses differing in tone, structure, or emphasis. For some inputs exact oracles still apply (e.g., factual questions with unambiguous answers), but for most generative tasks the field has converged on several alternatives:

Rubric-based evaluation. Define natural-language criteria (e.g., factual accuracy, coherence, relevance) and score outputs against them, often using another model as an automated judge. This is effective at scale but inherits the judge model’s own biases (e.g., positional preference, sensitivity to phrasing), and the rubric itself encodes assumptions about what “good” means.
Metamorphic relations for language. Paraphrasing a question should not change the factual content of the answer; translating and back-translating should preserve meaning. These are property oracles adapted to generative outputs, an application of Chen et al.’s (1998) framework to a domain far removed from its origins.
Red-teaming and adversarial probing. Systematically searching for inputs that elicit harmful, incorrect, or policy-violating outputs, an exploration-intent technique with a human or rubric-based oracle.

The common thread. Whether the system under test is a classical ML pipeline, a reinforcement learning policy, or a large language model, the fundamental challenge is the same: the oracle must judge behavior, not just structure, and the behaviors that matter most in deployment are often the hardest to specify in advance. The dimensional framework applies uniformly: the question is always what scope, intent, environment, and oracle are appropriate, even as the specific oracle strategies differ.

The test space

The dimensions described in the preceding sections form a space. A concrete test is a point in it; a test strategy is a distribution of effort across it:

$$\mathcal{T} = \text{Scope} \times \text{Intent} \times \text{Environment} \times \text{Oracle}$$

The space is sparse: many cells are impractical, and the dimensions are not fully independent (scope constrains environment; regulatory acceptance constrains oracle). Conventional terms name regions of this space by fixing some dimensions and leaving others as wildcards:

Term	Scope	Intent	Env.	Oracle
Unit test	Component	—	In-process	Exact / Property
Integration test	Subsystem	—	In-process / Sim.	—
System test	System	—	Sim. / HIL / Field	—
Regression test	—	Regression	—	—
Acceptance test	—	Acceptance	—	—
Smoke test	System	Regression	—	Exact

Reading this table. Each — is an unspecified dimension — a decision the term leaves implicit. These mappings describe typical usage, not definitions; they are approximate by design. A “unit test” usually uses an exact oracle (assertEqual), but property-based testing frameworks like QuickCheck and Hypothesis operate at component scope with property oracles. A “regression test” specifies only intent; scope, environment, and oracle are all open. The value of the table is not in fixing rigid definitions but in making visible how much conventional terms leave unspecified.

Recurring challenges

The dimensional view makes certain patterns visible.

The structural-behavioral gap. The gap between structural and behavioral correctness, introduced in the oracle discussion above, is one of the most common blind spots in test strategies. A test suite can achieve high coverage while checking only structural properties (e.g., correct types, interfaces, control flow) without checking whether the system’s outputs are substantively correct. This is an oracle problem: the tests use exact oracles on structural properties when the critical question requires property, statistical, or human oracles on behavioral properties.

This pattern is not unique to any particular workflow. Wherever tests are written without intimate knowledge of the system’s intended behavior — its domain invariants, its failure modes, its edge cases — the tests tend toward structural checks, because structural properties are observable from the code alone. Behavioral oracles require understanding what the system should do, not just what it does do. This applies equally to a developer unfamiliar with the domain, a team working under time pressure, or an AI coding agent generating tests without system-level context. The issue is not who writes the test but whether the test writer (human or otherwise) has access to the domain knowledge needed to specify meaningful behavioral oracles. A passing CI pipeline built on structural tests provides less assurance than it appears to.

Testing a moving target. In R&D, features often have not stabilized. Tests written against an immature interface break not because the system regressed but because the specification moved. The effort shifts from catching bugs to maintaining tests. Regression tests assume a stable specification; when the specification is in flux, exploration and progression are more appropriate. This is a specific case of premature intervention: committing resources to a structure that has not yet settled.

Concentrating effort where testing is convenient. Test effort naturally clusters where tooling supports it: in-process, component-scoped, exact oracles. But production failures often involve interactions between components, behavior under degraded conditions, or inputs that were never anticipated. The gap is not in coverage of the code but in coverage of the phenomena.

The limits of sampling

Outside of exhaustive enumeration, testing is sampling. Every test evaluates the system on a finite set of inputs and declares the result representative. For sufficiently small finite input spaces (e.g., a Boolean function of a few variables, or all entries in a lookup table), exhaustive enumeration is possible: the sample covers the entire space, and the test result is definitive for that space. Model checking achieves this for finite-state systems. But for most real systems, the input space is far too large to enumerate, and any test suite is necessarily a sample. How representative that sample is depends on the relationship between the test distribution and the deployment distribution — and on the nature of the risk.

Quantifiable risk. When failure modes are known and their probabilities estimable, statistical testing is effective. Run $n$ tests, observe $k$ failures, bound the failure rate. This is where statistical oracles and confidence intervals are meaningful.

Statistical uncertainty. When data exists but the distribution is not fully known, bounds are weaker. A test suite may cover the observed distribution well while missing regions that have not been encountered. This is where most ML systems on real data live: a model tested on a curated dataset may perform well there and poorly on data from a different distribution.

Knightian uncertainty. When the failure modes themselves are unknown, no amount of testing provides guarantees. A system deployed in an uncontrolled environment faces inputs outside any test distribution. ISO 21448 (SOTIF) frames the goal as shrinking the “unknown unsafe” region, acknowledging that it cannot be eliminated.

These regimes require different responses. In the quantifiable regime, the question is sample size. In the statistical regime, the question is distribution coverage. In the Knightian regime, the question shifts from “does the system work correctly?” to “does the system fail safely?”, and the response moves from testing toward runtime monitoring, graceful degradation, and safety architecture. Butler and Finelli (1993) showed that for ultra-high reliability targets ($10^{-9}$ failure rate), the required test volume is infeasible, a concrete illustration that statistical evidence from testing has fundamental limits.

The spectrum of rigor

Given these limits, testing practices range widely, and most of this range is legitimate.

At one end: someone runs the system, tries some inputs, watches the output. Informal, unrepeatable, but in early development it may be all that makes sense. When interfaces are unstable and requirements unclear, investing in automated test infrastructure can be premature.

In the middle: automated test suites, continuous integration, regression pipelines. Most software teams operate here. The dimensional decomposition is most useful at this level, as a tool for auditing where effort is allocated and where gaps exist. Mutation testing (introducing small faults into the source code to check whether the test suite detects them) provides a complementary lens: rather than asking “what did we test?”, it asks “what faults would our tests miss?”, a direct measure of test suite adequacy that is independent of code coverage.

At the other end: formal verification. Formal methods prove properties about the system’s behavior for all possible inputs using mathematical proof. Dijkstra (1970) observed that program testing can show the presence of bugs, but never their absence; formal verification, in principle, can show absence. In practice, formal methods verify models of the system, and the gap between model and reality is its own source of failure. Formal verification is established in hardware design, protocol verification, and safety-critical software (e.g., avionics, automotive, nuclear). For most teams, full formal verification is beyond reach, but its existence clarifies what testing is and is not: outside of small finite spaces and formal proofs, testing is sampling, and sampling has limits.

The right rigor depends on what is at stake. A research prototype justifies ad hoc testing. A consumer product justifies automated regression suites. A flight control system justifies formal verification of critical properties. Many teams operate under constraints where rigorous testing is a luxury. The question for those teams is not “are we testing enough?” but “given what we can afford, are we testing the right things?”

Physically-facing systems

Systems that interface with the physical world — through direct actuation (e.g., robotics, autonomous vehicles, drones, industrial automation) or through real-world perception (e.g., anomaly detection on sensor data, computer vision on medical or satellite imagery, environmental monitoring) — are where all of these problems compound, because each dimension has richer structure.

The simulation fidelity ladder (MIL → SIL → PIL → HIL → field) is not only a cost tradeoff; each rung has a different fidelity envelope. The sim-to-real gap is a fundamental epistemic boundary: simulation tests can only observe phenomena that the simulator models. The oracle problem is harder: there is generally no exact expected output for a perception pipeline, and the conditions under which correctness matters most (e.g., sensor noise, weather, novel objects) are the hardest to specify oracles for. Non-determinism is intrinsic: sensor noise and actuator variance mean that running the same test twice may not produce the same result. Scope boundaries are porous: isolating a component requires simulating its physical context, and the simulation is itself a system with its own fidelity assumptions.

Sensor log replay (recording real sensor data and replaying it through the software pipeline — e.g., rosbag replay in ROS, or equivalent tooling in other frameworks) is an effective regression technique because it makes good choices along each dimension: system scope, regression intent, real-data environment, differential oracle. Compare current pipeline output to previous output on the same recording to catch unintended changes. The technique works well, but designing it systematically — choosing which recordings, which oracle, what constitutes a meaningful difference — remains largely ad hoc.

Learned policies (trained via reinforcement learning, imitation learning, or other methods) compound the problem further. A policy may be tested in simulation during training, on hardware in a controlled setting, or both, then evaluated on held-out scenarios and deployed — each stage occupying a different point in the test space. The critical question is behavioral: does the policy do the right thing under sensor degradation, novel objects, and conditions it has never encountered? This is the oracle problem, the environment problem, and the uncertainty problem, all at once.

Practical guidance

The point of decomposing tests along multiple dimensions is not to add process but to make visible what is already being decided.

Enumerate the space. For each component, subsystem, and system: what intents apply? What environments are available? What oracles are feasible? The grid reveals gaps.

Match oracle to environment. An exact oracle in simulation is only as good as the simulation’s fidelity. A property oracle (e.g., “the robot does not collide with a known obstacle”) can be valid across environments because it constrains behavior rather than predicting output.

Defer regression investment until interfaces stabilize. When the specification is in flux, exploration and progression are more valuable. The time to invest in regression infrastructure is when the interfaces it guards are stable enough that breaking them is informative.

Test behavior, not just structure. Structural tests (e.g., correct types, interfaces, return codes) are valuable; they catch real bugs and are cheap to write. But they do not substitute for behavioral coverage. The key is whether the test writer has access to the domain knowledge needed to specify meaningful behavioral oracles. This is a question of inputs to the test design process, not of who or what performs the writing.

Design for testability. Controllability (ease of placing the system in a known state) and observability (ease of inspecting internal state) must be designed in. The PIE model (Execution, Infection, Propagation — Voas, 1992) formalizes this: a fault must be executed, cause infection of intermediate state, and propagate to an observable output. Code coverage measures execution but does not directly measure infection or propagation.

Accept irreducible costs. Some tests must run on hardware. Some oracles must be human. Weyuker (1986) proposed axioms for test adequacy and showed that most common criteria fail to satisfy them all. A principled strategy identifies where costs are unavoidable and allocates resources accordingly.

Closing remarks

This decomposition is a lens, not a ground truth. The dimensions proposed here are not exhaustive, not perfectly independent, and not the only useful decomposition. The deeper point is that testing can be a principled practice. The conventional vocabulary is useful shorthand, but it leaves most decisions implicit. Making them explicit, even approximately, is what allows a team to reason about whether its testing investment matches its actual risk, given its actual constraints. That reasoning is available to any team. The investment is less in tooling than in clarity about what the tests are for.

References

Works cited

Barr, E.T., Harman, M., McMinn, P., Shahbaz, M. and Yoo, S. (2015). “The Oracle Problem in Software Testing: A Survey.” IEEE Transactions on Software Engineering, 41(5), pp. 507–525.

Beck, K. (2002). Test-Driven Development: By Example. Addison-Wesley.

Butler, R.W. and Finelli, G.B. (1993). “The Infeasibility of Quantifying the Reliability of Life-Critical Real-Time Software.” IEEE Transactions on Software Engineering, 19(1), pp. 3–12.

Chen, T.Y., Cheung, S.C. and Yiu, S.M. (1998). “Metamorphic Testing: A New Approach for Generating Next Test Cases.” Technical Report HKUST-CS98-01, Department of Computer Science, Hong Kong University of Science and Technology.

Cohn, M. (2009). Succeeding with Agile: Software Development Using Scrum, Addison-Wesley. See also: “The Forgotten Layer of the Test Automation Pyramid.” Mountain Goat Software.

Dijkstra, E.W. (1970). “Notes on Structured Programming.” EWD249, Technological University Eindhoven. Reprinted in Dahl, O.-J., Dijkstra, E.W. and Hoare, C.A.R. (1972), Structured Programming, Academic Press.

Fowler, M. (2012). “TestPyramid.” martinfowler.com.

Meszaros, G. (2007). xUnit Test Patterns: Refactoring Test Code. Addison-Wesley.

Myers, G.J. (1979). The Art of Software Testing. John Wiley & Sons.

Rothermel, G. and Harrold, M.J. (1997). “A Safe, Efficient Regression Test Selection Technique.” ACM Transactions on Software Engineering and Methodology, 6(2), pp. 173–210.

Voas, J.M. (1992). “PIE: A Dynamic Failure-Based Technique.” IEEE Transactions on Software Engineering, 18(8), pp. 717–727.

Weyuker, E.J. (1986). “Axiomatizing Software Test Data Adequacy.” IEEE Transactions on Software Engineering, SE-12(12), pp. 1128–1138.

Standards cited

DO-178C. Software Considerations in Airborne Systems and Equipment Certification. RTCA, 2011.

IEC 61508. Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems. International Electrotechnical Commission.

IEEE 1012. IEEE Standard for System, Software, and Hardware Verification and Validation. IEEE, 2016.

ISO 21448:2022. Road Vehicles — Safety of the Intended Functionality. International Organization for Standardization.

ISO 26262. Road Vehicles — Functional Safety. International Organization for Standardization.

ISO/IEC/IEEE 29119. Software and Systems Engineering — Software Testing. International Organization for Standardization.