Functional Safety Hazard Analysis and Risk Assessment for Self Driving Cars
Introduction to HARA Exposure, Hazards, ASIL Levels and Safety Goals
Hello 😃 In this medium article, we are going to refine how we analyze hazards and assess risks. Before we start, we need to define which part of vehicle will be under consideration. We are going to use an Advanced Driver Assistance System, in short ADAS, as an example.
Once we have chosen a system to investigate, we need to brainstorm and come up with different driving scenarios like driving in the snow on a highway or parked in a parking lot. We call this situational analysis. Separately, we’ll think about different ways that the vehicle can malfunction. We’ll brainstorm again to identify what could go wrong with the ADAS. We will then evaluate malfunctions on the various driving scenarios and calculate a risk factor called ASIL, automotive safety integrity level.
The ultimate goal of hazard analysis and risk assessment is to define requirements specifying what ADAS needs to do in order to avoid hazardous situations.
Introduction to HARA
HARA stands for Hazard Analysis and Risk Assessment. In the previous medium articles mentioned below, we already saw basic ideas behind identifying hazards and assessing risks. ISO 26262, formalizes this process with hazard analysis and risk assessment.
Remember, that the ultimate goal of functional safety is to reduce risks to levels acceptable by society. In hazard analysis and risk assessment, we will figure out what could go wrong with your item. Then, we will measure the risk associated with it unintended behavior. We will then identify what the item needs to do to remain safe. We will call this unintended behavior, a malfunction. For example, if a lane keeping a system function, don’t twist the steering wheel too hard, the driver could loose control of the vehicle. There is a potential for vehicle accident, so excessive torque would be a malfunction that leads to hazardous situation. The hazard analysis and risk assessment contains five parts:
- Situational analysis, where you identify driving conditions, like city driving in rain.
- Hazard identification, where we analyze what goes wrong with the item under consideration.
- Classification of the situation and hazard according to severity and probability of occurrence.
- Calculating the ASIL.
- Driving safety goals that identify what the system will do to lower risk.
HARA Identification of Hazards
After a situation analysis, the next step is to identify hazards. Remember that for ISO 26262 hazardous situations arise from electronic malfunctions that could cause injury to humans or damage human health. Malfunctions are unintended behavior. If the purpose of the lane departure of warning function is to vibrate the steering wheel, a malfunction could be that the vibration is too strong. The hazard is that the driver could lose control of the steering wheel which could result in a risk of collision with another vehicle. Like the situation analysis, we have a list of guide words to identify malfunctions. We also have a list of potential accidents or hazardous events that could occur. Notice that we are not yet concerned with the technical implementation of the item.
For instance, we are not yet thinking about which part of the lane departure of warning is causing the excessive vibrations. Maybe it could be a software bug or an issue with the vibration motor. But we are not concerned with that level of detail yet. We also assume that every other item in the vehicle is functioning correctly. We’re only concerned with the lane assistance item. An entire hazard identification would include the function definition, the malfunction under consideration chosen from the guide words, the details of the malfunction, the accident called the hazardous event, details about the event and a summary description of the hazardous event.
HARA Risk Assessment, Severity and Exposure
It’s time to combine situations and hazards together and then evaluate risks. Once we know how high our risks are, we can figure out how to bring risks down to acceptable levels. Back in the previous medium articles, referenced above, we talked about evaluating risk with the equation risk equal severity, times probability of occurrence. In that article, we simplified things a bit.
In the ISO 26262 standard, the risk is defined slightly differently with the equation risk equal severity of a malfunction times probability of loss due to a malfunction.
Probability of loss due to a malfunction takes into account two terms called exposure and controllability. Risk then equals severity times exposure, times controllability. Let’s discuss each one of these factors in turn. We introduce severity earlier.
Severity measures how badly a person could get injured in an accident. In the ISO 26262 standard, severity has four levels, labeled S0, S1, S2 and S3. S0 represents no injuries whereas S3 implies life threatening or fatal injuries as indicated in the image below for reference.
For example, a situation in which the vehicle travels over 40 kilometers per hour would have severity of S3. In the situational analysis for the lane departure warning, we considered that the driver was traveling at a high speed, so the severity would be S3. We can use this table below as a guide for evaluating the severity of the situation under consideration.
Now, let’s talk about Exposure. Do you remember from the introductory articles the definition of probability of occurrence, probability of occurrence actually measures how often or how long drivers would find themselves in a specific situation? For example, driving on a freeway parked in a parking lot, or driving on a wet road. Exposure is defined exactly like how we define probability of occurrence in the introductory lesson. Exposure has scale from E0 to E4. For our lane departure warning example, the situation involved highway driving on VicRoads. According to functional safety standard, driving on VicRoads is E3.
The final tone for evaluating risk is controllability. Controllability measures how likely the driver will be able to gain control of the vehicle during a hazardous event. Controllability has a scale from C0 to C3, where C3 is a situation where an average driver could not maintain control of the vehicle. If the lane departure warning function causes the steering wheel to vibrate excessively with wild swings of the steering wheel, most drivers would have difficulty controlling the vehicle. We will put the controllability at C3.
We are also providing a table to help distinguish between different controllability levels.
We can now evaluate the risk of our lane departure. One example, we will combine severity, exposure and controllability into a risk factor called ASIL. Next, we will discuss ASIL in more depth.
HARA ASIL Levels
Severity, exposure and controllability provide a systematic way for evaluating risk. We take these three factors and combine them into a metric called Automotive Safety Integrity Level or ASIL. ASIL measures the amount of risk of each hazardous situation. Lower risks are classified as ASIL A, whereas the highest risks are classified as ASIL B. ISO 26262 provides a table for combining the three terms together into an ASIL.
Let’s calculate the ASIL for our lane departure warning example. We can see in the table that combining S3, E3 and C3 leads to ASIL C. So excessive steering wheel vibration is a hazard with relatively high risk. This table brings up a couple of points that we haven’t covered yet. What is QM?
And what happens if a hazard has a severity, controllability or exposure levels of S0, E0 or C0?
If the risk assessment contains an S0, E0 or C0, then the risk is automatically marked as QM. QM stands for quality management and implies that the risk is already below acceptable levels. There is no need to apply ISO 26262 to a hazardous situation of QM because risk is already low enough. However, there are other standards that apply including the automotive quality management standard IATF 16949. Now that we know how to evaluate risks, we will discuss what the system needs to do in order to lower the risk.
HARA Safety Goals
The last step is to derive safety goals, which brings us back to requirements engineering. A safety goal defines what the item does to avoid hazardous situations. So a safety goal is a type of requirement. For the lane departure warning function, the hazardous event description was that the lane departure warning function applies too high an oscillating torque to the steering wheel, which could lead to a crash. What would the lane departure warning function need to do to avoid the situation?
Limit the amount of vibration. So the safety goal would be the oscillating steering torque from the lane departure warning function shall be limited. So the safety goal has ASIL C. Limiting the vibrational steering torque would help lower the risk to an acceptable.
Summary
We now have the tools to carry out hazard analysis and risk assessment. Let’s summarize what we have learned. First we decided which vehicle item was under consideration. In this case we were looking at a lane assistance item. Then we identified various driving situations. The next step was to imagine lane assistance malfunctions that could lead to hazardous situations. We assessed risk based on severity, exposure, and controllability. The risk assessment produced [inaudible] for these hazardous situations. In the final step, we define what the item needs to do in order to remain safe. These are called safety goals.
Functional Safety at the Software and Hardware Levels for the Self Driving Cars
Introduction to Robustness, Quality, Freedom from Spatial Interference, Interference Communication and System Architecture Safety Design Patterns
Hello and welcome to this medium article 😃 ISO 26262 sees functional safety as a systems engineer problem. Hardware and software are just two parts of a vehicle system. The function safety standard puts hardware and software development at the bottom of the V diagram for a reason. This helps ensure that we have developed our safety requirements while taking into account the vehicle as a whole, otherwise we could miss potential hazards. In this medium article, we are going to give a broad overview of hardware and software development.
Hardware Failure Metrics
Random hardware failures will happen. There’s no way around it. For example, silicon can wear out or connections can break due to thermal expansion. Cosmic radiation and magnetic fields also cause hardware to fail as well. The number of failures permitted within a given period of time, will depend on the ASIL. For example, an ASIL D element should have fewer than one failure every 100 million hours. Besides these hardware failure target values, the function safety standard defines a few random hardware metrics called architectural Safety Metrics. These metrics are: the single point fault metric, the latent fault metric and the probabilistic metric for random hardware failures. Each ASIL has different allowable thresholds for these three metrics.
Programming Languages
One of the first considerations in software development is choosing a programming language or modeling framework. Some are more suitable for functional safety than others. C, C++, and MATLAB/SIMULINK are all good choices. All of these share certain characteristics that makes them appropriate for function and safety applications. First, they all have an unambiguous definition of syntax and semantics. Second, they all run on real time operating systems. Third they all support runtime error handling. Finally they support modularity, abstraction, and object oriented design. Function safety also requires the use of a software development guideline. Two of the most common guidelines are MISRA C and MISRA C++.
MISRA stands for Motor Industry Software Reliability Association. These guidelines define a subset of the C and C++ programming languages that are appropriate for safety critical applications. They discuss defensive implementation techniques, language subsets, style guides and naming conventions.
What is MATLAB?
In the 60s and 70s, solving problems with computers required writing lots of code. Professor Cleve Moler wanted his students to focus on solving math and engineering problems. So, he created MATLAB using syntax that mirrors common science and engineering notation. Today, students use MATLAB in many disciplines such as robotics, machine learning, image processing, and biology, just to name a few.
But MATLAB is not used just by students, engineers and scientists are using MATLAB to develop automated driving systems, making our roads safer, create new medical devices to improve care for millions of people, change the way we live through robotics, and explore the mysteries of the solar system.
Software Safety Requirements Architecture Testing & Integration
From the standpoint of functional safety, software development involves four main tasks. The first task is developing software safety requirements. The second task is to specify a software architecture. Then, the third task is to test the software to make sure that the architecture meets the requirements. Lastly, software is integrated with hardware.
These four tasks might look familiar. They are the same four steps we’ve discussed previously for all levels of the V-diagram in the previous medium articles referenced below at the end of this medium article.
Where do software safety requirements come from? Many software safety requirements are derived directly from the technical safety requirements. In general, these requirements cover functions that enables the system to maintain or reach a safe state. Functions for detecting, indicating, and handling faults in both hardware and software. Typical software safety requirements might be based on timing constraints such as fault tolerant time intervals. Warning light functionality is also often associated with software safety requirements. Communication interfaces like CAN, Ethernet connections, or user interfaces are also commonly solved with software safety requirements. Keep in mind that there is no such thing as a separate safety software architecture. The safety software requirements are allocated to the overall product architectural design. Besides technical safety requirements, there are a few other sources of software safety requirements that we will talk about in the next part of the.
Software Safety Robustness and Quality
Flaws of the requirements are one of the major causes of accidents. The standard provides many recommendations about how to develop safety critical software. For example, the standard discusses driving software safety requirements from two core principles beside technical safety requirements. The first principle is ensuring robustness and quality. The second principle is freedom from interference. We will talk first about ensuring robustness and quality. Robustness specifically refers to whether we have software in the face of embedded inputs or stressful environmental conditions.
An invalid input for example could be a parameter that is outside its allowed range. Quality means that the software meets its functional requirements as well as its nonfunctional requirements, like maintainability, adaptability, usability, and performance. So far, we have seen how technical safety requirements, as well as quality and robustness, lead to software safety requirements. Next, we will talk about software safety requirements for freedom from interference. Interference between elements can happen at any level in the V model. Because software elements cannot always be physically separated, software interference is an especially important topic.
Freedom from interference means that one software element should not cause a failure in another software element. Hence, software is partitioned into separate pieces so that the failures do not spread. ISO 26262 does stop mandates freedom from interference, in cases where the software elements have the same ASIL. However, we need to prove freedom from interference when software components with different ASIL ratings communicate with each other, or are running on the same ECU. To ensure software elements do not interfere with each other, we need to understand three types of interference. These are spatial, temporal, and communication interference.
Freedom from Spatial interference
Freedom from spatial interference means one software partition should not change the code or data of another software partition. In other words memory and storage between software elements should be separated. Otherwise, code and data can become corrupted. In practical terms we’ll need to be mindful of read, write and execution permissions. Think about the relationship between software partitions as being one of trust. A high element mistrust any low level element. So an ASIL D could read from a QM element. The ASIL D element however should minimize reading from a QM element. An ASIL D could also write to a QM element. But the QM element should only be able to read from ASIL D. A key element should not write to the higher ASIL element. Likewise, if QM elements should be able to execute functions provided by an ASIL D element, but as an ASIL D element would not trust QM functions and would not execute a QM function.
As a concrete example, what happens if a software bug in a QM element mistakenly creates a pointer address to an ASIL C partition?
The QM element might then write to an ASIL C element. This could lead to a safety goal violation, so we should either prevent or detect it. Next, let’s discuss temporal freedom from interference.
Freedom from Temporal interference
Temporal interference refers to one element blocking the execution of another element over time. For example, if two software elements share data, a higher priority thread could continuously get access to the data and the low priority thread would always be waiting. This is called blocking of execution.
There are many other cases of temporal interference. Deadlocks occur when two executed threads need each others resources; though thread one needs resource B, but has to resource A. Thread two needs resource A, but has resource B. A similar interference is called Livelocks. Where two threads wants the same resource. In this case both threads have the courtesy to let the other thread go first, but they keep stepping aside and then trying to grab the resource simultaneously.
Another example of temporal interference is incorrect synchronization between software elements. For example, consider an autonomous vehicle that has both a radar system and camera system for vehicle detection. Each system will have its own issue. But what if the clocks on the issues are not synchronized properly, then sensor fusion could not be done since there would be no way to compress signals over time. Hence, clocks need to be synchronized or all software elements need to use the same master clock. When any time or executioner related fault occurs, safety mechanisms need to take action. This can be a degraded functionality and eventual transition to a safe state to avoid safety goal violations.
Now let’s discuss incorrect allocation of execution time and incorrect execution sequence. These types of temporal interference occur when software elements execute early, late, out of order, or that take too long to execute. The function safety standard recommends three different mechanisms for dealing with execution time and execution sequence faults. They are alive supervision, deadline monitoring and control flow monitoring. All three of these mechanisms work with checkpoints. There are software element reports its status at the beginning and end of execution. A live supervision limits the number of times a software element can execute within a given time span. When a checkpoint is reached, the system would analyze the number of times the element was executed. Deadline monitoring looks at how long it takes to execute a software element. If an element takes too long, an error has occurred. And lastly, control flow monitoring ensures that software progress is in the correct order. For example, the execution sequence might be out of order or a software element gets executed in the middle of the sequence instead of at the beginning. Separate software blocks would be included in the system architecture to analyze the checkpoint. If a fault occurred that violated the safety goal, then the system should transition to a safe state.
Freedom from Interference Communication
The last source of interference we will discuss is communication interference. In vehicle systems, software elements need to exchange information both within ECUs and between ECUs. Even if the sender and receiver have the same ASIL, there is a need to protect against interference. Communication channels between software elements often will have a lower ASIL or a QM rating. A protected communication channel will detect data transmission errors. One of the most common mechanisms for ensuring freedom from communication interference, is called E2E or End to End Protocol.
An E2E Protocol can help protect against communication falls due to hardware, software, and electromagnetic interference. The protocol checks that data was not corrupted during transmission. This way, when there is data corruption that could violate the safety goal, the vehicle can be led to a safe state. In the lane departure warning example, we had a technical safety requirement that could be solved with an E2E protection mechanism. One requirement was that a vanity and integrity of the data transmission of the LDW torque request signal shall be ensured.
System Architecture Safety Design Patterns
Design Patterns are proven, well established architectures. One of the most common software patterns in the automotive industry is called E-Gas. An E-Gas pattern, defines three levels of software. Level one is a functional level which contains the intended functionality of the system. For the lane assistants example used in previous medium articles, the intended functionality would be, to turn the wheel towards center and vibrate the steering wheel based on camera data. Levels two and three of the E-Gas pattern, are for monitoring. Level two refers to functional monitoring which monitors level one for any safety goal violations. For example, level two software, would monitor the lane assistants time and torque request as discussed in the previous articles.
Level three is for processor monitoring of major hardware components. Level three software boot, for example, monitor RAM, ROM and control flow. In our lane assistants example, the level one software would send torque requests to the steering wheel motor. If the level two or three software detects an error, then output from level one is disabled and the level two or three software, leads the system to a safe state. In this case, we determined that the safe state would send a torque output of zero.
With this, we have come to the end of this article. Thanks for reading this and following along. Hope you loved it! Bundle of thanks for reading it!
My Linkedin :)