APEI and ACPI error injection

Resources

Introduction

This page outlines an ACPI table mechanism, called EINJ, which allows for a generic interface mechanism through which OS can inject hardware errors to the platform without requiring platform specific kernel level software. The primary goal of this mechanism is to support testing of OS error handling stack by enabling the injection of hardware errors. Through this capability OS is able to implement a simple interface for diagnostic and validation of errors handling on the system.


Hardware Error Reporting Mechanism

One of the non-x86 specific error type that platform may report to kernel is a generic hardware error source (GHES). How does the GHES error reporting get implemented within the Linux kernel?

GHES Design Documentation

A generic hardware error source is an error source that either notifies OS of the presence of an error using a non-standard notification mechanism or reports error information that is encoded in a non-standard format. Using the information in a Generic Hardware Error Source structure, kernel configures an error handler to read the error data from an error status block – a range of memory set aside by the platform for recording error status information.

Beginning phase of error reporting till the point where GHES driver forwards parsed data to RAS daemon is described in detail in APEI Design Documentation and illustrated below.

GHES


Error Injection Mechanism

The Error Injection Table provides a generic interface mechanism through which OS can inject hardware errors to the platform without requiring platform specific OS software. System firmware is responsible for building this table, which is made up of Injection Instruction entries.

EINJ Design Documentation

EINJ provides a hardware error injection mechanism, this is useful for debugging and testing of other APEI and RAS features.

EINJ Table

Kernel driver is able to handle with hardware error injection as long as vendor supply EINJ ACPI table which contain at least mandatory injection actions:

Injection Action

Mandatory

BEGIN_INJECTION_OPERATION

NO

GET_ERROR_TYPE

YES

SET_ERROR_TYPE or/and SET_ERROR_TYPE_WITH_ADDRESS

YES

EXECUTE_OPERATION

YES

CHECK_BUSY_STATUS

YES

GET_COMMAND_STATUS

YES

GET_TRIGGER_ERROR_ACTION_TABLE

YES

TRIGGER_ERROR

YES

END_OPERATION

NO

An Injection action consists of a series of one or more Injection Instructions. An Injection Instruction represents a primitive operation on an abstracted hardware register, represented by the register region as defined in an Injection Instruction Entry. An Injection Instruction Entry describes a region in an injection hardware register and the injection instruction to be performed on that region. Table below contains allowed primitive operation:

Injection Instruction

READ_REGISTER

READ_REGISTER_VALUE

WRITE_REGISTER

WRITE_REGISTER_VALUE

NOOP

Injection Control Flow

Injection from User Space into the Linux kernel is possible thanks to debugfs(8) file system debugger. User is able to do couple of things:

  • fetch errors which are available on the given machine to inject
  • set/get one specific error (the one from available group) which is candidate to be injected
  • initiate error injection

Once user sets one particular error and initiates error injection, control is passed to kernel level. Error injection operation is a two step process where the error is injected into the platform and subsequently triggered. After software injects an error into the platform using SET_ERROR_TYPE action, it needs to trigger the error. In order to trigger the error, the software invokes GET_TRIGGER_ERROR_ACTION_TABLE action which returns a pointer to a Trigger Error Action table. Software executes the instruction entries specified in the Trigger Error Action Table in order to trigger the injected error.

Before kernel can use this mechanism to inject errors, it must discover the error injection capabilities of the platform by executing a GET_ERROR_TYPE. After discovering the error injection capabilities, user can force kernel to inject and trigger an error according to the sequence described below.

NOTE: Injecting an error into the platform does not automatically consume the error. In response to an error injection, the platform returns a trigger error action table. The software that injected the error must execute the actions in the trigger error action table in order to consume the error. If a specific error type is such that it is automatically consumed on injection, the platform will return a trigger error action table consisting of NO_OP.

Step description

Mandatory

1. Executes a BEGIN_ INJECTION_OPERATION action to notify the platform that an error injection operation is beginning.

NO

2. Executes a GET_ERROR_TYPE action to determine the error injection capabilities of the system.

YES

3. Kernel sets the type of error to inject.

YES

4. Executes an EXECUTE_OPERATION action to instruct the platform to begin the injection operation.

YES

5. Busy waits by continually executing CHECK_BUSY_STATUS action until the platform indicates that the operation is complete by clearing the abstracted Busy bit.

YES

6. Executes a GET_COMMAND_STATUS action to determine the status of the read operation.

YES

7. If the status indicates that the platform cannot inject errors, stop.

YES

8. Executes a GET_TRIGGER_ERROR_ACTION_TABLE operation to get the physical pointer to the TRIGGER_ERROR action table. This provides the flexibility in systems where injecting an error is a two (or more) step process.

YES

9. Executes the actions specified in the TRIGGER_ERROR action table.

YES

10. Execute an END_OPERATION to notify the platform that the error injection operation is complete.

NO

EINJ

LEG/ServerArchitecture/RAS/EINJ (last modified 2017-08-17 12:13:07)