2

4


3. Methodology

3.1Basic principles

3.2   Structure of the methodological framework

3.3  Comparison with other methodologies

3.4 Indirect safety assurance, through the development process

3.5 Specific safety assurance


3.1 Basic principles

A few principles must be kept in mind by organisations in charge of the system development and safety validation:

1.    safety must be ensured, even when some clear safety requirements are not explicit in the initial list of requirements. The need for new safety requirements must be reported, in order to add them when necessary.

2.    no requirement should require that a system is 100% safe (all complex systems have some probability, however small, of falling into an unexpected state).

3.    safety requirements (including those added) should be verifiable, and should be traced through the whole development.

4.    safety assurance should be an integral part of the development life cycle from the start.

5.    similarity with previous operational systems, which already are "safety-validated", is a very important factor for validating the new system; outputs from safety validation of these previous systems must be reused wherever possible, so that experience gained is not lost. Of course, they should be reused only where possible (differences in context of use often make this impossible).

6.    all safety-related activities should be formally recorded, for justification reasons.

3.2   Structure of the methodological framework

This framework is composed of two complementary parts:

·      indirect safety assurance, dealing with the system development process: methods are not specifically dedicated to safety, but improve safety when used;

·      specific safety assurance, dealing with methods specifically addressing safety (for the automated system).

 

The main reasons for separating them are:

·      to avoid mixing up the safety-specific part of the methodology, which is used only because the system is safety significant, with "normal" methods, which should be used even for systems which are not safety significant;

·      to separate activities which a safety manager should more specifically manage.

 

But both parts are complementary and have to be used together in order to ensure safety. Also note that the Safety Plan described in the second part impacts on activities described in the first part. This implies that the separation made is convenient, but that all activities linked to system development should be integrated in a single system engineering process, taking into account safety issues. In both parts (3.4 and 3.5), the methodological framework is presented chronologically.

3.3  Comparison with other methodologies

 

This general approach to safety assurance is similar to the one recommended by ARP4754 document for aircraft development: this document states that the system development process and the safety assessment process should interact, and that both should be used for the aircraft certification process.

 

 

This approach is also consistent with the one used by EUROCONTROL Safety Assessment Methodology, as presented in the following figure:

Coverage of the life cycle in EUROCONTROL methodology

 

However, in order not to duplicate lots of works already or planned to be done, ARIBA focussed on aspects felt important and not or little dealt with within  EUROCONTROL document:

·      unlike the EUROCONTROL methodology, the present document includes a part on "indirect safety assurance" (i.e. methods to be followed all along the development cycle  and which, even though not specific to safety are felt to impact the level of safety very much);

·      ARIBA tries to keep a specific focus on practical issues (i.e. recommendations about how to do the work practically, e.g. for COTS);

·      in ARIBA, there is also a focus on the need (felt very important) for international standardisation about the way of expressing and assessing safety requirements and rules for ensuring safety, specific to the ATM domain. (EUROCONTROL document began work on this issue, e.g. risk classification scheme).

An identification of main differences between EUROCONTROL methodology and ARIBA one is presented as an appendix, at the end of the present document.

 

3.4 Indirect safety assurance, through the development process

3.4.1          Rationale

ATM systems are software-based complex systems. In such systems, although hardware failures can occur, most events with a negative impact on safety are caused by software bugs or, more generally, insufficient quality of specification, or design or implementation of the system.

 

Therefore, all methods used to improve and assure this quality favourably impact on safety, either directly or indirectly (through improved dependability), and are a key element of a safety case. This should not be underestimated, and this is recognised by guideline documents such as ED-12B/DO-178B, used in aircraft development. Many of the requirements stated in these guidelines are related to this kind of issue, such as the emphasis put on test coverage and partitioning.

 

Experience shows that, for ATM, it could be considered as a set of guidelines, with some adaptations, but that taking it as a standard to be strictly applied to ATM without any adaptation is impossible, due to specific ATM characteristics:

·      very large systems

·      very complex systems; this complexity sometimes makes partitioning difficult

·      wide use of COTS products, both during development and in operations (including compilers, operating systems, COTS libraries, etc.)

·      wide use of components already developed, either adapted or not for use in the new system (actually, new developments generally relate to a part of the system, the other parts remaining largely unchanged).

 

To develop these systems, a specific life cycle model is not required (this is easily adaptable to variant life cycles). However, due to main characteristics of these systems, specifying and applying reference processes covering all activities in the spirit of quality assurance standards such as ISO 9001 is recommended. Furthermore, taking into account safety aspects when specifying these processes is also recommended. In the following sections, practical methods adapted to the characteristics of ATM systems and so recommended to be used while applying these processes are described. The reader will considered them together with methods described in section 3.5 (Specific Safety Assurance) to get a more complete view of what is proposed as safety assurance and validation of automated systems by manufacturers).

 

Adaptation according to safety criticality

 

Not all parts of an ATM system have the same safety-criticality. The relation to safety of some of them is very remote (for example, this is the case of long term assessment of demand to forecast the traffic level long in advance). The description assumes that the system of concern is one with direct impact on safety, such as ATC automated systems. When the safety criticality is lower, the development assurance level may also be lower. Nonetheless, it should be considered that the following recommendations apply to all parts of an advanced ATM system, except when explicitly stated.

3.4.2          Case of newly developed parts

3.4.2.1          Along the whole life cycle:

-     Documentation management plan: A documentation management plan should specify documents needed and their contents, and provide templates. This is required to make information easily accessible to everybody needing it (availability or not of needed information is an important quality factor). Standard references such as DOD-STD-2167A, or a similar standard, are recommended.

-     Configuration management: configuration management techniques are necessary to control the complexity of the system and of its successive changes (an uncontrolled system is often unsafe).

-     Requirement traceability: A tool managing requirements traceability should be used all along the life cycle (the number of requirements usually makes manual management very tedious or even nearly impossible). Requirement traceability, including of course dependability and safety requirements, is essential to safety assurance.

-     Tools: All tools used should be either suitably certified or proven in use. This does not apply to parts of the system that are not safety-critical.

 

3.4.2.2          Operational use definition

Note that this work has generally been done when specifying requirements. It can be skipped, or simply checked, in this case.

It consists in defining how its operational users will use the system. This use must be consistent with applicable operational procedures.

It is now recognised that this phase is particularly tricky, and many problems in later phases have their origin in the lack of objective data collected.

Work analysis techniques are recommended to complete the operational needs expression by objective data, which can support (or not) some of the requests. These techniques can also help to identify the tasks to be realised, data used by the operators, constraints in the work activity.

When defining the operational concept, work analysis techniques are recommended during simulations running experiments, for test of operational concepts, by explaining how operators work with the proposed operational concept. Questions are:

·      What is the performance, which can be ensured using this solution?

·      What does it change for the operator in terms of mental demands, required skills, and risk of error?

 

The recommended work analysis techniques depend on the objective or on data already available:

·      Before writing down requirements related to operational use, knowledge-elicitation techniques (interviews with operators and more sophisticated techniques...) are recommended when there is a need to fully understand how operators work and how they take decisions. Such information is needed for validating their needs.

·      Once requirements have been written down, they should be reviewed. This is useful for checking their completeness, consistency, ambiguities, testability, etc. The favoured technique is the "phased inspections" technique (see references for details). Phased inspections have the objective of guaranteeing (almost) 100% problem detection while saving much time and money. A very high problem detection rate is achieved by:

¨    performing several iterations with different goals and different people (chosen according to goals),

¨    using check lists and computer tools designed for assisting inspectors in their work,

¨    and verifying the rigour of inspectors' work (statistics about the use of the assistance tools, questionnaires they should be able to answer...).

·      For HMI assessment, the following methods are recommended:

¨    As a first step, and when there it is felt that there is a high risk that the HMI will not do, HMI modelling is recommended for formal description of an HMI before implementing it, and its assessment from this description.

¨    Then, fast prototyping is recommended to implement HMI requirements (at least those which involve most risk), in order to validate them with operators. A prototype is recommended whenever a new HMI is developed, in order to:

Þ  make requirements "visible", so that they are more easily assessed

Þ  confront involved parties to the consequences of their wishes,

Þ  make sure that ATM developers understand requirements correctly.

¨    Human factor techniques: once a prototype has been built, these techniques, which include such specialised work analysis methods as electrocardiograms, or eye-tracking, may be used as a complement, when equipment and experienced staff is available, to assist in validation of HMI usability requirements.

 

3.4.2.3          "Bid" phase

As a call for tender is usually issued for developing the new system, manufacturers have to prepare a bid, and must take safety into account during this preparation in two ways:

·      analysis of safety requirements, and of the impact of other requirements on safety (see section 3.4.2); it is important to detect any safety-related problem in the call for tender during this phase, to be aware of possible changes which would be required;

·      analysis of methods for ensuring safety required in the call for tender, when they are methods not usually used by the manufacturer, and their possible impact (on cost, duration, etc.);

·      and description of the proposed safety case.

 

3.4.2.4          System specification

This is the phase translating operational requirements used as inputs, including dependability and safety requirements, into system requirements.

The main problem is to ensure that during this translation, safety-related requirements are considered, and that none of them get lost. Besides normal requirement traceability, it is useful to use the following method:

 

-     Definition of standard system specification rules (especially rules intending to ensure dependability). These rules should be referred to in the safety management plan (see section 3.4)

 

Also refer to section 3.4.5 for safety-management aspects.

 

3.4.2.5          Design

Methods recommended are those useful for improving the design (for all aspects of the design, and especially safety-related aspects):

 

·      Definition of standard design rules (especially rules intending to ensure dependability); as a simple example, such a rule can be "always monitor periodic input, and raise an alarm when input is missing"). These rules should be referred to in the safety management plan (see section 3.4)

·      Prototyping

¨    before committing to the choice of a new technology during the design phase, it is important to ensure that this technology is fully understood, assessed, and mastered by ATM developers. There is always a risk in adopting new technologies, new methods, new tools, new approaches, etc. (either really new or new in the ATM domain). Exploratory prototyping is recommended in this case, as it is usually the best way to achieve these objectives.

¨    when some difficult design choice must be done, or when an alternative must be chosen, exploratory prototyping is also recommended.

·      Dependability-related techniques: Design choices must be validated relatively to dependability; the recommendation is that all techniques aiming at assessing dependability may be used here, but first at a high level only: input data required generally are missing, or are too unreliable to make a low level study worthwhile; only when some experience is available, providing a good feedback and reliable data, it is recommended to try these techniques at a lower level, for new systems with the same kind of design;

·      when requirements imply the development of new algorithms, new protocols, etc., automated theorem “provers” may help to prove their correctness (thus mitigating one of the risk factors).

·      Performance modelling and simulation: In simple cases, it is recommended to validate design choices, relatively to performance requirements, by modelling and simulating the system through specialised tools. Results should be later refined all along the life cycle, as more precise data become available, and especially when considering the development of successive releases of the same system (or the same family of systems).

·      Reviews: again, reviews are strongly recommended for design documents: studies show that the design phase is a major source of errors. The "phased inspections" technique is favoured (see above).

 

3.4.2.6          Software coding

·      Programming rules are required and must have the following objectives:

¨    ensuring robustness; for example, the policy to be used for dealing with exceptions should be defined. The robustness property is essential to dependability assurance, as not all situations are foreseeable.

¨    ensuring correctness and readability; for example, using different variables with about the same meaning and almost the same name may lead to errors difficult to detect; programming rules should prevent this kind of errors;

·      Testing: testing is a requirement, and is the traditional way to remove dependability risks linked to programming errors; testing groups many techniques which aim to discover errors in programs (i.e. non-conformance to specification). These techniques are well known, and do not require further development in this report; the following points are part of the methodological framework.

¨    It must be kept in mind that a 100% test coverage is impractical in such complex systems.

¨    Nonetheless, (not too high) test coverage objectives have to be defined, and checked. These objectives should be defined according to the safety impact of the component (higher in components impacting much on safety, in accordance with section 3.4.5 below).

¨    Tests should be reusable (for use as regression tests); test programs can be used for that. Commercial tools are available and are recommended here for producing reusable HMI tests. Generally speaking, it is recommended to automate testing to the maximum extent.

¨    Failure modes must be tested.

¨    Orthogonal testing (i.e. testing that unwanted things do not happen) should not be forgotten.

¨    Another technique must be used to make up for the incomplete test coverage.

·      Reviews: This is a good complement to testing, as there is no better method to detect errors not detected by testing. The "phased inspections" technique, aiming at 100% default detection, is favoured (see above). This technique is very appropriate, as software programming is its primary field of application. It is recommended to use automated tools to the maximum extent to assist this work, and to develop them if needed. A complete coverage of source code, especially for all components likely to impact on safety, is possible and recommended (except for code generated by code generators). This has to be done as part of an optimised strategy describing tests and reviews and their relative scopes, and mentioned in the safety management plan (see section 3.4).

 

3.4.2.7          Integration

·      Reviews are recommended: the main interest of reviews (especially "phased inspections") for integration is the verification that their integration will cause no error. Interfaces between the current system and the new component must be checked with a special care. Such inspections must be automated to the maximum extent (through internal or standard tools; e.g., as a simple example, many checks on C programs interfacing may be automated by using the lint tool)

·      Integration testing is required; this may use all techniques aiming at discovering errors which can appear only after integration, especially in interfacing, and in functions which use both new components and other parts of the system. Recommendations about testing given above also apply.

·      Regression testing: this is a check that changes to the existing system have not introduced errors in previously tested parts. Its application is very simple if all tests already run have been carefully recorded. It is recommended that their use should be automated and systematic.

3.4.3          Special case: use of COTS software, or of already-developed software

This section addresses COTS (commercial off the shelf) software, either included as part of the system (e.g. software libraries), or used to produce the system (e.g. code generators). It mainly applies to software potentially impacting on safety (tools such as text editors are not considered here).

 

Most of the above recommendations cannot be applied to COTS software. The method recommended in such as case is:

·      assess the safety criticality of the product, according to its intended use

·      then, gather all possible information on:

¨    the way this product has been developed,

¨    and/or statistics on its reliability, from past experience in its use,

¨    and, if an international, inter-domain, certification scheme, with specified safety levels, has been defined (see WP2 report), the certified safety level of this product (if certified), together with the defined meaning of this level.

·      If data gathered provide enough evidence that the product is unsafe for its intended use, discard it.

·      Test this product; this is always useful to get practical experience on its use, and to evaluate it. If data gathered in the previous phase were not sufficient to provide the required trust in the product reliability, test coverage should be as wide as practicable. However, testing for ultra-high reliability requirements is not practicable.

·      One of the techniques used should be fault injection to test the reaction of the COTS product and to test the robustness of the system in case of a COTS failure (see e.g. [Voas, 1999]).

·      If neither available data, nor testing, provide required evidence, reverse engineering tools should be used to justify dependability. In this case, there should be a focus on safety-related characteristics, to limit costs, as this method may be very costly (see: ARIBA WP2 and WP5 final reports).

·      If after using above methods, there is still no sufficient evidence that using this product in this context would be safe, discard the product and find another solution. As an alternative, when missing evidence is limited, it may possible in some cases to write a "wrapper" for inclusion between the COTS product and the remaining of the system, this wrapper providing missing guarantees through appropriate checks.

 

Note that this work does not have to be done again, if the same product was already used for the same usage in a previous system. Available data may be used in this case, with a check that they are still valid for the new system. For data no longer valid, e.g. because of a different usage environment, the process has to be repeated.

 

It is good practice to write a software package to encapsulate the COTS item to only allow the propagation of wanted effects into the wider system.

3.5 Specific safety assurance

3.5.1          Introduction

This section applies to safety-critical parts of the system, i.e. most parts of a complete ATM system, excluding only some parts where the cost or performing recommended activities is obviously not justified by safety benefits, such as tools for long term assessment of traffic demand.

 

The following safety activities are recommended (excluding responsibility issues, which are dealt with in WP6.2):

·      Initial safety assessment

·      Assessment of safety-related activities to be performed for this system (they depend on several criteria.)

·      Planning of safety programme

·      Identification of hazards and specification of mitigation solutions

·      Monitoring and tracking of hazards and safety issues

·      Verification that the system complies with safety requirements

·      Safety-related support during installation, commissioning, overall validation, and transition.

 

This is summarised in the following figure:

Specific safety assurance process

3.5.2          Initial safety assessment

This is an initial work to be performed on input data (especially requirements).

The objective is checking scope, completeness, consistency, ambiguities, testability, ...of stated safety requirements, and impact on safety of other requirements (especially identification of "unsafe" requirements).

New safety requirements may have to be added during this activity, either for legal reasons, or because of the safety policy of manufacturer involved, or simply because it is found that stated requirements are not sufficient to guarantee a safe system.

It is a recommended to use:

·      a standard check list of rules that requirements specification must respect (for example: the allowed domain of input values must be specified, as well as the behaviour of the system in case of a value outside its domain);

·      standard recommendations for hardware selection

 

Currently, one of the major problems is the great variety in the expression of safety requirements, and in levels of safety required, without any clear reasons for these differences. Therefore, the generalised use of unique reference standards, adapted to the ATM domain, is recommended for description of safety requirements. In these standards, numerical figures should not be the primary references, as assessment of numerical figures is often disputable.

More generally, this phase should be supported by standard specification rules related to safety, together with standard checklists to assist in the application of these rules.

 

The favoured method is reviews, and more specifically the "phased inspections" technique, already addressed above.

3.5.3          Assessment of safety activities to be performed

This is tailoring the recommended framework to actual needs and requirements, when needed.

This is needed in the cases below.

·      The system (or more probably the considered component) does not have a significant impact on safety.

·      The system is very similar to one for which safety activities have already been done; in this case, already available results may and should be reused, wherever possible.

·      Stated requirements include the requirement that some other methodology should be used for this development, instead of, or in complement to, the recommended methodological framework. Parts of the framework impacted by this requirement should be adapted accordingly.

 

No technique is specifically recommended, but principles of inspections should be used, such as the use of checklists.

3.5.4          Planning of safety programme

This is the production of a Safety Plan, specifying all required activities related to safety. It should be based on a standard content, adapted to the system of concern, according to safety requirements and to above assessment, and agreed by the future user of the system.

The Safety Plan should include:

·      references to other standards applicable (without duplicating them);

·      safety-related activities to be performed (by all organisations participating in the development), when they must be performed, and which methods must be used;

·      references to documents describing these methods;

·      roles, and competence and organisational interfaces;

·      deliverables from the safety programme;

·      how and when information on safety-related activities should be recorded.

 

It should also describe required actions, and allowed or required adaptations to the methodology, in non-nominal situations. A typical case is the occurrence of large delays or budget problems during development. If these difficult situations have not been explicitly considered, they might have a negative impact on safety, in practice (although they should not, in theory).

The recommended approach is that the safety plan should keep all activities that are required, because activities had been specified to ensure safety in the most cost-effective way, and to very clearly explain why they are required. Some adaptations could however prove useful, either to make safety activities still more rigorous, and/or to lower the level of stress whilst maintaining effectiveness of the process (e.g. changes in organisation of work).

 

All members of the teams involved in development and safety assessment must receive appropriate briefing and training about the safety policy, including those not present at the beginning of the project.

3.5.5          Identification of hazards, risk assessment and specification of mitigation solutions

 

The objective is to identify possible hazards and associated risks, and possible causes of these hazardous conditions, in order to produce a safe system.

 

This includes:

·      Choice or definition of a Risk Classification Scheme, including:

¨    severity categories, with precise definitions

¨    classes of likelihood

¨    classes of risk tolerability

This should be done very early in the process, and should be based on a standard scheme, but with verification that this standard scheme is adapted to the system and context considered.

This is not related to the automated system itself, but the resulting scheme is a necessary input to following activities.

·      Identification of hazards and failure modes; several methods should be used concurrently to get a first list as complete as possible:

¨    meetings with experts and experienced people, with the help of a structured method (e.g. structured brainstorming, for getting the results of brainstorming in a structured framework)

¨    use of lists already available for similar systems, and from reference books, and from actual historical accident and incident logs,

¨    FHA (Functional Hazard Analysis), at a level depending of the system complexity. Using this method at a very low level for a very complex system is not practicable.

¨    Techniques such as FMECA (Failure Mode, Effects and Criticality Analysis). Their scope of application depends on the system complexity. For complex systems such as a complete ATC system, it can only be done, in practice, at a very high level (high-level components of the system, and communication between these components).

Once this list of hazards is available, it should be reviewed (for completeness, relevance, consistency, etc.) The "structured inspections" technique is favoured for this review.

¨    Assignment of a severity to each hazard: this depends on possible consequences of each hazard, and must follow the Risk Classification Scheme. This should be done through meetings with experts and experienced people, with the help of a structured method

·      Estimation of hazard likelihood: this evaluates how often the hazard could happen (frequency of occurrence by hour and for the projected lifetime of the system). This evaluation should be based on the study of initiating events, contributing factors, and probability of failure of features aiming at removing this hazard. Techniques favoured are:

¨    established techniques, such as Fault Tree Analysis, remaining at a rather high level, in the case of complex systems;

¨    stochastic techniques when feasible; this requires availability of experienced specialists, necessary data and models, etc.; for example, refer to ARIBA WP4 report and to Fota’s and Blom’s papers (see references).

·      Risk assessment; this combines hazard severity and hazard likelihood to estimate the risk produced by each hazard, and to compare this estimate with previously defined thresholds of acceptability and tolerability. This is the goal of practices such as PSSA (preliminary system safety assessment).

·      Risk reduction; this activity defines the means to be used, adaptations to be done to the design, etc. in order to ensure that the system produced is at least tolerably safe. This may be through:

¨    removal of all unacceptable hazards, where practicable;

¨    mitigation of hazards not removed to an acceptable level.

This may require:

¨    re-specification

¨    re-design

¨    incorporation of safety features

¨    incorporation of warning devices

¨    new operating and training procedures.

Techniques to be used depend much on the problem to be solved. All techniques aiming at improving reliability should very often be considered (e.g. replication of critical components).

Hazard likelihood and risk assessment should be updated once solutions have been defined. Where the risk cannot be reduced to an acceptable level, the activity should be stopped.

3.5.6          Monitoring and tracking of hazard and safety issues

All identified hazards, their characteristics, the source of their identification, solutions chosen to solve safety risks they raise, and, generally speaking, everything concerning safety-related issues should be recorded in a safety log, augmented with new information all along the life cycle.

3.5.7          Verification that the system complies with safety requirements

This is part of the normal life cycle, which is not specific to safety (testing and reviews).

However:

·      In the case of safety-related requirements, results of tests and reviews should be specifically reviewed and checked.

·      Techniques used in other domains for System Safety Assessment may be used when practicable.

·      When mitigating features have been added their efficacy should be specifically tested.

3.5.8          Safety-related support during installation, commissioning, overall validation, transition, operation

Safety may be impaired by the way a system is installed, operated or maintained. A support activity is often required in order that the (safe) system is used in a safe way.

It is especially important that system developers provide all information about the automated system, which is necessary for the operator of the system to organise operations in a safe way.

 

Of course, whenever it is intended to use the system in a different manner, or when changes are to be incorporated, additional safety analysis is essential.


2

4