Once in a while, a disaster happens and when it happens, the first question will be which servers were affected?
Unfortunately, the answer to this question is not accurate. You are probably thinking that I am wrong. Since the “Failed to connect to computer” alert will pop up when the agent will stop sending a heartbeat for 3 seconds (the default value) and no ping answer from the Management Server. This is a great monitor, however, the issue comes up when the agent has some kind of a problem. Let’s say the agent had a problem before the disaster such as service being shut down or data being corrupted or even somehow the agent is deleted from the server. In these scenarios, the “Failed to connect to computer” alert will not be pop up. I’m sure many of the SCOM administrators are familiar with the problem.
This problem motivated me to create a new management pack.
My solution is based on SQL query and ping check from a management server.
The following steps will explain to you how to solve the problem:
First, get all the unhealthy agents that are not in maintenance mode. The reason for this is that the unhealthy agents are the only agents that could be affected by the disaster. Second, check the ping status from agent’s management server (Note: The management server itself can be shut down, therefore, if necessary we will check the ping from the secondary management server and if needed from the RMS server). The output will be stored as an event in the Operations Manager database. The third step is to run a simple SQL query that will return a list of servers that have been affected as a result of the disaster (I added a sample query at the end of this page).
While I was working on this management pack I had another idea, which was to repair the unhealthy agent automatically. Let me explain since we are already collecting unhealthy agents that answer to ping, we can easily reach the conclusion, that the problem is not a server. Therefore, trying to repair the agent is necessary. So to repair the agent, all we will do is run a task of restarting the agent. For the agents that have a problem with collecting performance data, we will run the flush agent task.
Link to the SCOM Administration Add-Ons.
SQL Query Example:
DECLARE @StartTime DateTime DECLARE @EndTime DateTime SET @StartTime =DATEADD(MINUTE, ((DATEPART(MINUTE, GETUTCDATE()) / 5) * 5)-30, DATEADD(HOUR, DATEDIFF(HOUR, 0, GETUTCDATE()), 0)) SET @EndTime = DATEADD(MINUTE, ((DATEPART(MINUTE, GETUTCDATE()) / 5) * 5), DATEADD(HOUR, DATEDIFF(HOUR, 0, GETUTCDATE()), 0)) SELECT v.TimeGenerated ,DATEADD(MINUTE, ((DATEPART(MINUTE, v.TimeGenerated) / 5) * 5), DATEADD(HOUR, DATEDIFF(HOUR, 0, v.TimeGenerated), 0)) AS TimeGeneratedFixed ,EventParametersXML ,y.Status ,y.StatusCode ,y.ResponseTime ,y.AgentServerName ,y.ManagementServerName INTO #EventAllView FROM EventAllView v OUTER APPLY (SELECT CAST(v.EventParameters AS XML) AS EventParametersXML) x OUTER APPLY (SELECT x.value('Param[1]', 'VARCHAR(80)') AS Status ,x.value('Param[2]', 'VARCHAR(80)') AS StatusCode ,x.value('Param[3]', 'VARCHAR(80)') AS ResponseTime ,x.value('Param[4]', 'VARCHAR(80)') AS AgentServerName ,x.value('Param[5]', 'VARCHAR(80)') AS ManagementServerName FROM x.EventParametersXML.nodes('/') AS NodeValues (x)) y WHERE PublisherName = 'ServerConnectivityCheck' AND TimeGenerated >= @StartTime AND DATEADD(MINUTE, DATEDIFF(MINUTE, 0, TimeGenerated), 0) <= @EndTime ;WITH TimesCTE ([Date]) AS ( SELECT @StartTime UNION ALL SELECT DATEADD(MINUTE, 5, [Date]) FROM TimesCTE WHERE [Date] < @EndTime ) SELECT res.AgentServerName, res.[Date], v.TimeGenerated, v.Status, v.StatusCode, v.ResponseTime, v.ManagementServerName into #tmpPivot FROM ( SELECT srv.AgentServerName, c.[Date] FROM ( SELECT DISTINCT AgentServerName FROM #EventAllView ) srv CROSS APPLY ( SELECT [Date] FROM TimesCTE ) c ) res JOIN #EventAllView v ON DATEADD(MINUTE, ((DATEPART(MINUTE, v.TimeGenerated) / 5) * 5), DATEADD(HOUR, DATEDIFF(HOUR, 0, v.TimeGenerated), 0)) = res.[Date] AND v.AgentServerName = res.AgentServerName ORDER BY res.AgentServerName, res.[Date] DECLARE @CMD VARCHAR(MAX) SET @CMD = 'SELECT * FROM (SELECT AgentServerName, StatusCode, CONVERT(VARCHAR(80), DATEADD(HOUR, (DATEDIFF(HOUR, GETUTCDATE(), GETDATE())), [Date]), 108) AS [Date] FROM #tmpPivot) AS SourceTable PIVOT (MAX(StatusCode) FOR [Date] IN (' SET @CMD = @CMD + ( SELECT LEFT(txt.grouped, LEN(txt.grouped) - 1) FROM ( SELECT '[' + CONVERT(VARCHAR(80), DATEADD(HOUR, (DATEDIFF(HOUR, GETUTCDATE(), GETDATE())), [Date]), 108) + '],' AS [text()] FROM #tmpPivot GROUP BY CONVERT(VARCHAR(80), DATEADD(HOUR, (DATEDIFF(HOUR, GETUTCDATE(), GETDATE())), [Date]), 108) ORDER BY CONVERT(VARCHAR(80), DATEADD(HOUR, (DATEDIFF(HOUR, GETUTCDATE(), GETDATE())), [Date]), 108) FOR XML PATH('') ) txt(grouped)) SET @CMD = @CMD + ')) AS PivotTable ORDER BY AgentServerName' EXEC (@CMD) DROP TABLE #EventAllView DROP TABLE #tmpPivot