Esxi Health Monitor: Host has bad drive, Monitor did not alert

This forum supports the ESX Host Health Monitor plugin. When posting post screenshots of issues and any script and command logs listed in the probe consoles.
Post Reply
TPriest@rocketit.com
Posts: 7
Joined: Tue Jul 17, 2018 12:53 pm

Esxi Health Monitor: Host has bad drive, Monitor did not alert

Post by TPriest@rocketit.com »

Good Morning everyone,


I had a client reach out to me yesterday and asked about a yellow light on their esxi server. I checked the health monitor and it showed up-to-date data on the host, no alarms. I logged into the IDrac of the host and confirmed that a drive has failed. I would like to help the community by sending logs or whatever you may need to correct this problem. We went a month without knowing about this drive. There is a potential that a driver is not being recognized, any suggestions? Thanks!

User avatar
Cubert
Posts: 1483
Joined: Tue Dec 29, 2015 7:57 pm
Contact:

Re: Esxi Health Monitor: Host has bad drive, Monitor did not alert

Post by Cubert »

I am assuming the drive is now fixed and green lighted? If so we won't be able to live test the probe but what we can do is see if #1 was probe running on agent and what is its output. If you manually run probe command on agent direct from desktop of agent do you get good clean returns?

The plugin schedules a script on each agent probe based on the intervals you set in the plugin (once a day to every hour in the day). The probe gets the script and executes it. The script executes at the command line of agent 2 different commands and reads back the return and places that data in the database. Then there are 3 internal monitors that execute a SQL query against the database looking for any recent failures and if found alarms. Once failure is removed and repaired the SQL query goes green and alarms stop.

You should edit the monitors and set them to alert you in the manor you desire.

To troubleshoot an issue I start with the following test viewtopic.php?f=31&t=5532

This verifies that the probe can actually function as a probe. If you get a large spit out of data from CIM then your probe is functional.

Next would be to run probe from plugin, and allow it to complete. Once done did data refresh? If need be truncate the plugin_sw_esx_healthmonitor_CIMdata table (check SQL tables for exact name) and run a probe. then see if data appears.

if that works then go test your monitors by looking at the results tab of monitor. does it error or have any issues with data.

You can also go place a failure in the table and refresh monitor to see if monitor picks up the failure data to test the monitors.

If each step here is working correctly then you should get notices with in a very short time of any CIM failure being reported.


Some thing else to look at is did the drives get reported in the CIM data? Does the manual run show data not in the database? This might be another possible option. was more than 2 MB of data returned? LabTech has a 2 MB limit on returned data. This would be very hard to reach with CIM data but if it was some huge cluster with hundreds of drives being reported then 2mb limit could be reached.

TPriest@rocketit.com
Posts: 7
Joined: Tue Jul 17, 2018 12:53 pm

Re: Esxi Health Monitor: Host has bad drive, Monitor did not alert

Post by TPriest@rocketit.com »

Good Evening Cubert,


The drive is indeed green lighted, I had to replace it immediately as it was in a failure state for close to a month. I couldn't take the risk much longer, sorry! I used the test command linked and received a lot of feedback about an Unknown CIM error. Attached is the screenshot for review. Let me know if you wish for me to continue with your instructions, thanks for your help!
Attachments
ScreenConnect.WindowsClient_2018-12-31_13-58-34.png
ScreenConnect.WindowsClient_2018-12-31_13-58-34.png (158.36 KiB) Viewed 7775 times

User avatar
Cubert
Posts: 1483
Joined: Tue Dec 29, 2015 7:57 pm
Contact:

Re: Esxi Health Monitor: Host has bad drive, Monitor did not alert

Post by Cubert »

Ok What version of ESX are you on and has this ESX server been rebooted recently?

We seen fairly often that ESX stops responding to CIM requests or dumps no data and it turns out something is a miss on the ESX host. A fresh reboot tends to resolve the issues.

TPriest@rocketit.com
Posts: 7
Joined: Tue Jul 17, 2018 12:53 pm

Re: Esxi Health Monitor: Host has bad drive, Monitor did not alert

Post by TPriest@rocketit.com »

We are on Version 6.0 for ESX, and the host has been up for 456 days

I would assume this is a little too long for a host?

User avatar
Cubert
Posts: 1483
Joined: Tue Dec 29, 2015 7:57 pm
Contact:

Re: Esxi Health Monitor: Host has bad drive, Monitor did not alert

Post by Cubert »

sounds like a great opportunity to test the theory!

TPriest@rocketit.com
Posts: 7
Joined: Tue Jul 17, 2018 12:53 pm

Re: Esxi Health Monitor: Host has bad drive, Monitor did not alert

Post by TPriest@rocketit.com »

Do you have any other alternatives I could take besides rebooting this host? I feel like rebooting should be a last ditch effort since the host has been recently rebooted. I couldn't afford to reboot every host once a year if this is the resolution. The host is only two years old.

User avatar
Cubert
Posts: 1483
Joined: Tue Dec 29, 2015 7:57 pm
Contact:

Re: Esxi Health Monitor: Host has bad drive, Monitor did not alert

Post by Cubert »

ESX software some times just needs a refresh. Our plugin can not do anything like that to effect the host.

You might try ESX support team, they might be able to fire off a restart of services that will effect CIM without reboot.

Rebooting a Host with lots of business critical applications on it is not desirable but maybe necessary in the end.

TPriest@rocketit.com
Posts: 7
Joined: Tue Jul 17, 2018 12:53 pm

Re: Esxi Health Monitor: Host has bad drive, Monitor did not alert

Post by TPriest@rocketit.com »

Would restarting the CIM services have the same affect? I would also assume these logs are stored inside of the host itself?

User avatar
Cubert
Posts: 1483
Joined: Tue Dec 29, 2015 7:57 pm
Contact:

Re: Esxi Health Monitor: Host has bad drive, Monitor did not alert

Post by Cubert »

It might, I have had mixed results with service restarts.

Post Reply