In this article:

System Error Monitoring

System errors can be detected in technical logs that identify internal system events:

Sending data between Docker containers.
Interacting of mobile platform services.
Executing operations with databases via Docker containers, for example, open, write, close database.

For centralized logging of mobile platform components, use the docker-compose.metrics.yml file and also preinstalled applications:

fluentd. The application allows for collecting technical logs in the Elasticsearch storage built-in Foresight Mobile Platform.
Kibana. The application allows for getting and export information about technical logs to a *.csv file.

To detect system errors in technical logs:

Enable mobile platform component logging in the cluster if during installation of Foresight Mobile Platform a cluster based on Kubernetes, Deckhouse or OKD/OCP is deployed.
Get the list of technical logs using one of the methods:

Using the command:

kubectl logs -n <mobile platform server namespace> <pod name>

In the Kibana application. For details see the Viewing Technical Logs subsection.

After executing the operations, the list with technical logs may contain errors.

Error Classification

There are the following attributes of error classification to determine a possible reason of error occurrence:

Attribute	Description	Possible reason
Connection termination	After sending a request to mobile platform server the error 504 or 499 is returned. The response containing requested data is not returned.	Incorrect timeouts or insufficient volume of CPU and RAM usage
Connection hanging	After sending a request to mobile platform server the connection remains open and does not close for a long time. The response containing requested data is not returned.	Data source is not available
Mobile platform server unavailability	A request to mobile platform server is not sent periodically or permanently.	Mobile platform server is not available

Diagnostic and Eliminating Errors

To execute diagnostic and eliminate errors, use the following applications:

Grafana. The application allows for analyzing system resource usage volume (Step 2).
OKD. The application allows for analyzing work of a cluster based on OKD/OCP (Steps 2, 3, 4, 6).
Kibana. The application allows for analyzing statistical information about the system using visual representation of data (Step 7).

The order of actions is sorted from simple to complex. If required, the order can be changed:

Filter the log with system logs by the Error status in the System Logs section. If required, specify date ranges and time of event start and end.
Check CPU and RAM usage by containers and cluster nodes. To do this, see the subsections:

Viewing Volume of System Resources Usage by Containers in OKD. It is relevant for the OKD/OCP cluster.
Viewing Volume of System Resources Usage by Containers in Grafana. It is relevant for Kubernetes, Deckhouse or OKD/OCP cluster.
Viewing Volume of System Resources Usage by Cluster Nodes in Grafana. It is relevant for Kubernetes, Deckhouse or OKD/OCP cluster.

TIP. The optimum CPU and RAM usage volume should not be more than 70%.

Check work of pods on each node of the OKD/OCP cluster. To do this, see the Checking Work of Pods on Each Cluster Node and Auditing Their Logs subsection in OKD.
Check access to data source if the OKD/OCP cluster is used. To do this, see the Checking Access to Data Source subsection in OKD.
Check timeouts on the proxy server and framework, as well as set timeouts:

At proxy server before the cluster.
In the cluster (Ingress Controller).
In the data source.

NOTE. Timeouts should correspond to the actual request execution time.

Analyze events of the OKD/OCP cluster. To do this, see the Checking Cluster Events subsection in OKD.
Analyze statistical information about the system using visual data representation. To do this, see the Viewing Statistical Information about the System subsection in Kibana.

Restoring System

The system should be restored when the following emergency situations occur:

Mobile platform server is unavailable. This situation is characterized with failures during API user authentication and unavailability of the administrator console in the browser. For details see the Restoring Mobile Platform Server Availability section.
Environment is unavailable. This situation is characterized with failures during API user authentication and absence of the previously added environment in the administrator console. For details see the Restoring Environment Availability section.
Environment project is unavailable. This situation is characterized with failures during API user authentication and absence of the previously added project in the administrator console. For details see the Restoring Project Availability in Environment section.
Incorrect data caching work. This situation is characterized with displaying of outdated data on user mobile devices relative to data source, disabled button Update when viewing saved caches by parameters, errors and system freezes during cache update, unchanged cache version on successful update. For details see the Restoring Data Caching Work section.
Incorrect interaction between mobile platform server and external LDAP system. This situation is characterized with failures during user authentication from LDAP directory. For details see the Restoring Interaction Between Mobile Platform Server and External LDAP System section.
Incorrect work of system logs. This situation is characterized with absence of the latest records about events in the System Logs section. For details see the Restoring Log with System Logs section.
Failures during virtual machine work used by mobile platform server. This situation is characterized with absence of API user authentication and administrator console in the browser. For details see the Restoring Virtual Machine Work section.
Incorrect administrator console work. This situation is characterized with absence of response actions to button clicks in the administrator console. For details see the Restoring Administrator Console Work section.
Administrator console is unavailable. This situation is characterized with the 403 error on opening the administrator console via the HTTPS protocol. For details see the Restoring Administrator Console Availability section.

NOTE. It is relevant only for Foresight Mobile Platform 23.12.

Database replica is unavailable. This situation is characterized with increase of response time or absence of response from database on executing read requests, database pod state. For details see the Restoring Database Replica Availability section. The example of restoring database replica availability is given for the OKD/OCP cluster.
Incorrect work of OKD/OCP cluster nodes after their restart. This situation is characterized with absence displaying the number of pod containers, allocated memory, the error "/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)" on startup of the opensvwitch service. The issue can be caused by incorrect shutting down of cluster node and an error in the disk subsystem. To resolve the issue, reinstall cluster node. In the future, to prevent such a behavior, shut cluster nodes down according to the guide from the OKD application documentation.

To create a backup and restore the system, see the Creating a Backup and Restoring System section.