System Error Monitoring

In this article:

Error Classification

Diagnostic and Eliminating Errors

Restoring System

System errors can be detected in technical logs that identify internal system events:

For centralized logging of mobile platform components, use the docker-compose.metrics.yml file and also preinstalled applications:

To detect system errors in technical logs:

  1. Enable mobile platform component logging in the cluster if during installation of Foresight Mobile Platform a fault-tolerant cluster based on OKD/OCP was created.

  2. Get the list of technical logs using one of the methods:

kubectl logs -n <mobile platform server namespace> <pod name>

After executing the operations, the list with technical logs may contain errors.

Error Classification

There are the following attributes of error classification to determine a possible reason of error occurrence:

Attribute Description Possible reason
Connection termination After sending a request to mobile platform server the error 504 or 499 is returned. The response containing requested data is not returned. Incorrect timeouts or insufficient volume of CPU and RAM usage
Connection hanging After sending a request to mobile platform server the connection remains open and does not close for a long time. The response containing requested data is not returned. Data source is not available
Mobile platform server unavailability A request to mobile platform server is not sent periodically or permanently. Mobile platform server is not available

Diagnostic and Eliminating Errors

To execute diagnostic and eliminate errors, use the following applications:

The order of actions is sorted from simple to complex. If required, the order can be changed:

  1. Filter the log with system records by the Error status in the System Logs section. If required, specify date ranges and time of event start and end.

  2. Check CPU and RAM usage by containers and cluster nodes. To do this, see the subsections:

TIP. The optimum CPU and RAM usage volume should not be more than 70%.

  1. Check work of pods on each cluster node. To do this, see the Checking Work of Pods on Each Cluster Node and Auditing Their Logs subsection in OKD.

  2. Check access to data source. To do this, see the Checking Access to Data Source subsection in OKD.

  3. Check timeouts on the proxy server and framework, as well as set timeouts:

NOTE. Timeouts should correspond to the actual request execution time.

  1. Analyze cluster events. To do this, see the Checking Cluster Events subsection in OKD.

  2. Analyze statistical information about the system using visual data representation. To do this, see the Viewing Statistical Information about the System subsection in Kibana.

Restoring System

The system should be restored when the following emergency situations occur:

To create a backup and restore the system, see the Creating a Backup and Restoring System section.

See also:

Administration and Access Control