Example of self-healing scenario
Overview
For their daily operation, the devices of a public company must continuously run an enterprise application called MustRun. This application is not very stable though and it crashes intermittently. When mustrun.exe crashes, it is unable to shut down properly. As a result, the application usually leaves one of its log files in an inconsistent state. The corruption of this log file prevents MustRun from starting up again, until the file is deleted.
As soon as an end user realizes that MustRun is no longer running on a device, the user must restart the application by first deleting the corrupted log file and then relaunching mustrun.exe. This procedure is inconvenient for end users and has a negative impact on their productivity. In addition, unexperienced end users who are not familiar with the problem yet are very likely to require support from the help desk in order to solve it.
In this example, learn how to leverage Nexthink Act to automatically detect the crash of mustrun.exe, delete the log file, and restart the application. Automating the process results thus in increased productivity of the end users and a reduction on the number of reported incidents.
Creating the remote action
To remediate the problem of the MustRun application with Nexthink Act, create and schedule a remote action. The remote action shall include:
An investigation that selects those devices on which MustRun crashed recently.
A PowerShell script that deletes the corrupted log file and restarts the application on the selected devices.
Defining the target investigation
Start by creating an investigation that detects the crashes of the executable file mustrun.exe and returns the devices on which it happened. Since the Engine can insert events that lie up to 30 minutes in the past, set the time frame of the investigation to span the last 30 minutes. Combined with an appropriate scheduling of the remote action, this time frame ensures that no reported application crash is missed. This is a very conservative choice. Because of the speed at which Collector reports application crash events, time frames of around 10 minutes should be equally valid:
Because Nexthink detects application crashes, the investigation in the remote action of this example is already able to return the specific devices on which the problem occurs. On the other hand, in cases where Nexthink does not retrieve the information needed to detect the issue by default (e.g. a change in the value of a registry key), check the faulty condition in the script of the remote action itself. When the script checks the occurrence of a problem, the associated investigation must target all potentially impacted devices.
Therefore, we can classify the problem detection mechanism of a remote action as either:
Investigation-based.
Script-based.
Scheduling the remote action
After saving the investigation, create a remote action that periodically evaluates the investigation and runs the remediation script on the selected devices.
To properly schedule the remote action, configure the two periods:
Evaluation period
The time interval between two evaluations of the associated investigation. In our example, this value indicates how often the remote action checks for MustRun crashes. This period should be lower than or equal to the time frame of the associated investigation to not miss any application crash event. The smaller the evaluation period, the more responsive is the remote action, but also the more load is put into the system. To detect application crashes, an evaluation period of 10 minutes should be responsive enough. For critical applications, select a fast evaluation period as low as 1 minute.
Triggering period
The time interval between two consecutive triggerings of the remote action . For a remote action that detects issues by means of an investigation on events, such as application crashes, set the triggering period to be equal to the time frame of the investigation (30 minutes, in the example). This ensures that the execution of the script is not triggered more than once for the same event.
To associate the previously created investigation to the remote action, drag and drop the investigation onto the appropriate area of the editor of remote actions.
Adding the PowerShell script
Open your favorite text editor and type in the remediation script. Remember to encode the text of the script in UTF-8 with BOM when saving the file.
The script does the following:
Add the Nexthink dynamic library that deals with remote actions
(nxtremoteactions.dll)
by means of the Add-Type cmdlet.Initialize the result of the script to the empty string.
Initialize a couple of variables with:
The path of the executable
mustrun.exe.
The path of the corrupted log file to delete.
Try to remove the log file with the Remove-item cmdlet and set the result accordingly.
Restart the MustRun application with the Start-Process cmdlet.
Send the result to the Engine with the WriteOutputString function of the object NXT which was imported from the remote actions library.
Add-Type -Path $env:NEXTHINK\RemoteActions\nxtremoteactions.dll
[string] $result = ""
# The paths to the MustRun application and its log file
$mrexe = "$env:ProgramFiles\MustRun\mustrun.exe"
$logfile = "$env:ProgramFiles\MustRun\log.txt"
# Delete the log file if it is present
try {
Remove-item $logfile -ErrorAction Stop
$result = "The corrupted log file was deleted"
} catch {
$result = "The log file does not exist"
}
# Restart the application
Start-Process -FilePath $mrexe
[NXT]::WriteOutputString("Result", $result)
For security reasons, Nexthink recommends that you sign your scripts. For testing purposes, it is safe though to use unsigned scripts in pre-production environments only.
In the editor of the remote actions, click Import... to link the script to the remote action. The Finder interprets the source of the script and lists the Result output under the Outputs section.
Adapt the previous script to your own use cases for profiting of the self-healing capabilities of Nexthink Act.
RELATED REFERENCES