Here is another article ported from my old blog dating back to 2012. I recently revised a few key points. Once again, these observations haven’t changed significantly over time.
Deploying SCOM Agents using the Discovery Wizard appears to have at least four distinct phases: setup, discovery, installation, and initialization. Each of these stages has unique requirements and troubleshooting. It is very useful to think of the discovery process in terms of separate stages to aid in troubleshooting.
Setup Stage
This is where you select a management server, setup AD integration or specify specific server names, specify execution credentials, and specify the install directory.
Common mistakes during setup:
- Targeting the wrong management server. Your environment may have specific management servers designated for agent management. You can manually move the agents if ‘remotely managed‘.
- Specifying incorrect host names or failing to use FQDN names when listing target devices
- Specifying the wrong installation directory
Ways to avoid setup issues:
- Target the correct management server to avoid cleanup after
- Check your “Agent Managed” and “Pending Actions” lists for the targeted servers (lookup will fail if already managed)
- Use FQDN names and the “Browse” option to validate the server lookup in AD when manually entering server names
- If you are using an install directory other than the default; make sure to consistently set this properly (directory not found errors will not cause an error until the final stage)
Note: The discovery process needs to return one result from the AD lookup for each listed server. Using FQDN names helps avoid returning multiple results. Issues sometimes arise when two servers have similar names. For example, a query for MAIL02 could return two results if MAIL02 and MAIL022 are in the directory. SCOM won’t tell you when this happens….it will simply fail the lookup.
Discovery stage
This is where SCOM checks to make sure all of the necessary components are available to execute a remote installation. This appears to include AD lookup, DNS verification, and testing of several communication ports, services, and other necessary components to the setup process. At a minimum a discovery needs al DD lookup that returns a single result, DNS resolution, WMI, and TCP 135 (RPC), TCP/UDP 445 (SMB), and TCP 5723. The target device clearly needs to be online and accessible. Security permissions on the device also play a part in successful discovery. The important thing to understand is that the discovery stage does not appear to be logged and does not return a specific error. It either works or it doesn’t.
Common issues during the Discovery Stage:
- Discovery fails to return results or only returns partial results
Ways to avoid lookup issues:
- Make sure that the remote device is online and accessible
- Try changing management servers on the second attempt if possible (this works more often than you would expect)
- Check the target device logs for signs of WMI issues, verify core services are up, and verify action account permissions
- The target device or management server may need to be rebooted
- If a firewall is present verify communication over TCP 135 (RPC), TCP/UDP 445 (SMB), and TCP 5723
- The target server should be on the same domain unless you are using certificate based authentication
- On Windows servers try temporarily turning off the Windows firewall
- Make sure your management servers are healthy
Note: I have seen several posts stating that certain services need to be running though this has rarely proven to be relevant. The Windows Update service is commonly mentioned. The core services needed are usually integral to the OS and don’t need to be modified. Windows Update is not required. What you want to watch out for is chronic issues on the server that will be glaringly obvious in the event logs for WMI issues or other core service failure.
Installation Stage
Once you get past the discovery stage things start to become more clear. The main issue with discovery troubleshooting is a lack of logging. Once the installation begins, a MSI installation log is created on the management server selected to deploy the agent. Look in *\SystemCenter Operations Manager\AgentManagement\AgentLogs on the deploying management server for the install logs. Simply search these logs for “Return value 3” to find the cause of the installation failure. The MSI logs can be confusing but when you drill down to the “Return value 3” entry the cause is usually very clearly stated. You may also see an Operations Manager event log on the target Windows device if the install was partially successful.
Common issues during installation:
- Incorrect installation directory provided during setup
- Insufficient security permissions to complete the installation
- Issues on the target device are preventing the install (locked files, WMI query failure, event log corruption, registry key locks, core service issues, etc.)
- Remnants from a previous version or failed install may still be present on the target server
Ways to avoid installation issues:
- Make sure to complete the discovery setup correctly
- Check the MSI installation logs on the deploying management server
- Check the System and Application logs on the target server for additional information
- Remove previous installs and reboot the target server prior to retrying the install
- Address any permissions issues found in the MSI logs if that is the cause (the install must be allowed to read from the Security event log for example)
- Try locally installing the agent. If remote deployment is required; at least run through the setup steps to look for errors.
Note: “Microsoft.VC90: errors in the MSI logs or MSI Installer Event ID: 11935 in the Application log are indications that a reboot is needed to complete the installation.
Note: You need access to the remote device to troubleshoot installation failures; relying on information in the console is almost useless at this point. You may encounter situations where remote agent deployment fails on a server where you do not have logon access or remote log access.
Initialization Stage
This is the final stage where the agent completes the post-installation initialization process. Failing the initialization process will leave the successfully installed agent in a disconnected state. You will see this as a successful install that remains in Pending Actions or that stays in a grey state indefinitely.
The key here is check the Operations Manager event log to track the initialization process starting with Event ID: 7336. You may find it helpful to refer to the logs on a successful install for comparison. The basic initialization process includes identifying the RSM and Management Group, connecting to the MG, verifying that the MG action account, an authorization key exchange, confirmation, and finally MP download.
Common issues during initialization:
- No RMS or Management Group found (more common with misconfigured manual install)
- Authentication failure
- Agent is rejected by the SCOM for some reason (most often for a reinstall of a previously joined device)
- The ‘certificate’ enrollment (authentication registry key exchange) fails for some reason
Ways to avoid initialization issues:
- Make sure that your management servers are healthy and readily available
- Uninstall failed agents and clear lingering registry keys prior to reinstall
- You may need to clear lingering discovery data by running (Remove-SCOMDisabledClassInstance) on any management server
- Check OpsMgr event logs on the target management server for unauthorized agent events as another indication of auto rejection
Note: SCOM authenticates agents over TCP/UDP 445 (SMB). An alternative to TCP 445 is to use a digital certificate. The initialization process checks for this digital certificate before the TCP 445 authentication steps. There is a second round of ‘certificate’ enrollment in the agent initialization. This is one of the final steps in joining a Management Group. Rather than a traditional certificate exchange, this is actually referring to a registry value often called the ‘RMS Encryption Key’. This key is required to join SCOM. As you can imagine, anything preventing the storage or retrieval of this registry key will cause agent initialization to fail. This key is validated on a regular basis and can be the cause of agent health issues even after successful initialization. If you suspect that this is the issue; fully remove the locally installed agent, run Remove-SCOMDisabledClassInstance, and try again.
I hope that my first blog post was helpful. SCOM discovery and remote installation failure can be frustrating and should be anticipated during any planned agent deployment. If you are unprepared to address these issues in advance it can be very problematic. When in doubt verify that all of your management servers are healthy, try multiple management servers, remove previous installs and lingering registry keys, reboot the target server when necessary, and use the event logs and MSI logs to nail down the root cause for lingering issues.